Part 1 - Risk Model Challenge Analysis


Prepared by: Nahim Olguin

Table of Contents¶

  1. Business Objective
    • Basic Statistics
    • Merging Datasets
  2. Exploratory Data Analysis (EDA)
    • Univariate Analysis
    • Bivariate/Multivariate Analysis
  3. Data Preparation and Feature Engineering
    • Feature Engineering
    • Encoding
  4. K-means Clustering
  5. RFM Analysis
  6. Data Pipeline Transformation

  7. Model Development

    • Trainig and Test Oversampling SMOTE
    • SMOTE
    • Model Selection Hyperparameters
  8. Model Evaluation
  9. Model Interpretation
    • Feature Importance
  10. Results and Recommendations
In [5]:
%%html
<style>
    .qst {padding:10px; border-radius: 5px; border: solid 2px #EF6E1A;}
    .qst:before {font-weight: bold;display: block; margin: 0px 10px 10px 10px;}
    h1,h3 {color: #EF6E1A;}
</style>

Business Objective¶

To develop a predictive model that effectively differentiates between good and risky customers for Bankaya, thereby aiding in more accurate credit approval decisions. The model will specifically focus on new customers applying for their first loan to purchase a smartphone. Each step in the development process must be rigorously justified and documented to ensure both efficacy and transparency.

You can find the repository in Github in the following link: https://github.com/nahim21/Bankaya_risk_model


In [6]:
#!pip install imbalanced-learn
#!pip install mlxtend

# -- Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.offline as pyo
import json
import sys

pyo.init_notebook_mode(connected=True)
#pio.renderers.default = "notebook"
pd.set_option('display.max_columns', None)

# ML functions
from ML_functions.plots import plot_histogram
from ML_functions.util_functions import plot_diagnostic_outlier,select_diagnostic_outliers
from ML_functions.plots import plot_pie, plot_time_series, matrix_correlation

# --- Model functions

from imblearn.pipeline import Pipeline as ImbPipeline
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler


from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
import category_encoders as ce
import warnings
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV

# Performance Metrics

from sklearn.metrics import accuracy_score,roc_auc_score
from sklearn.metrics import confusion_matrix,classification_report
from mlxtend.plotting import plot_confusion_matrix
from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import f1_score
import matplotlib.pyplot as plt

# ======== LIBRERIAS MODELLING =======
from sklearn.linear_model import LogisticRegression
import  xgboost as xgb
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.dummy import DummyClassifier

warnings.filterwarnings("ignore")
In [7]:
# --- Load the datasets
customer_loan_df = pd.read_parquet("input/main_dataset.parquet") #---> main dataset
credit_report_df = pd.read_parquet("input/credit_reports.parquet") # ---> credit reports
In [8]:
# --- Examine the first columns of the customer dataset
print(f"The shape of the dataset is: {customer_loan_df.shape}")
customer_loan_df.head(7)
The shape of the dataset is: (14454, 17)
Out[8]:
customer_id loan_id ACC_CREATION_DATETIME APPLICATION_DATETIME LOAN_ORIGINATION_DATETIME max_days_late target account_to_application_days n_sf_apps first_app_date last_app_date n_bnpl_apps n_bnpl_approved_apps first_bnpl_app_date last_bnpl_app_date n_inquiries_l3m n_inquiries_l6m
0 1223 1 2021-08-23 13:57:56 2022-04-26 07:00:00 2022-10-20 18:26:35 5 0 245 1.0 2022-04-25 2022-04-25 1.0 0.0 2022-04-26 02:15:17.742 2022-04-26 02:15:17.742 91.0 91.0
1 5190 2 2022-04-26 09:57:25 2022-04-26 07:00:00 2022-12-22 18:03:32 2 0 0 NaN NaT NaT NaN NaN NaT NaT NaN NaN
2 5194 3 2022-04-26 12:22:35 2022-04-26 07:00:00 2022-08-09 17:31:05 4 0 0 NaN NaT NaT NaN NaN NaT NaT NaN NaN
3 3978 4 2022-03-09 11:26:55 2022-04-26 07:00:00 2022-11-11 19:48:08 2 0 47 1.0 2022-03-17 2022-03-17 2.0 0.0 2022-03-19 22:02:50.555 2022-03-09 16:41:49.675 13.0 13.0
4 4535 5 2022-04-01 14:28:42 2022-04-26 07:00:00 2022-07-24 12:07:15 2 0 24 1.0 2022-04-01 2022-04-01 1.0 0.0 2022-04-01 19:37:56.185 2022-04-01 19:37:56.185 21.0 21.0
5 3604 6 2022-02-21 11:55:32 2022-05-05 07:00:00 2022-12-08 07:12:18 7 0 72 1.0 2022-02-21 2022-02-21 NaN NaN NaT NaT 26.0 26.0
6 271 7 2021-04-06 16:30:21 2022-05-05 07:00:00 2023-04-21 15:40:10 0 0 393 NaN NaT NaT 1.0 0.0 2022-05-04 23:18:03.604 2022-05-04 23:18:03.604 11.0 11.0
In [9]:
# --- Examine the first columns of the customer dataset
print(f"The shape of the dataset is: {credit_report_df.shape}")
credit_report_df.head(7)
The shape of the dataset is: (287356, 29)
Out[9]:
customer_id INQUIRY_TIME CDC_INQUIRY_ID INQUIRY_DATE PREVENTION_KEY CURRENCY MAX_CREDIT CREDIT_LIMIT PAYMENT_AMOUNT UPDATE_DATE LOAN_OPENING_DATE LOAN_CLOSING_DATE WORST_DELAY_DATE REPORT_DATE LAST_PURCHASE_DATE LAST_PAYMENT_DATE PAYMENT_FREQUENCY BUSINESS_TYPE CREDIT_TYPE ACCOUNT_TYPE RESPONSABILITY_TYPE TOTAL_PAYMENTS DELAYED_PAYMENTS CURRENT_PAYMENT WORST_DELAY TOTAL_REPORTED_PAYMENTS CURRENT_BALANCE BALANCE_DUE BALANCE_DUE_WORST_DELAY
0 4223 2022-04-01 00:32:36.689000+00:00 710278-27993a6e-2885-48d4-a753-9249e7022af1 2022-04-01 SIN OBSERVACION PESOS MEXICANOS 1244.0 9900.0 1244.0 2022-02-28 2019-08-30 NaT 2022-01-31 2022-02-28 2019-08-30 2020-08-18 MENSUAL TIENDA DEPARTAMENTAL APARATOS/MUEBLES PAGOS FIJOS INDIVIDUAL (TITULAR) 12.0 7 13 13.0 0 1244.0 1244.0 1244.0
1 4223 2022-04-01 00:32:36.689000+00:00 710278-27993a6e-2885-48d4-a753-9249e7022af1 2022-04-01 SIN OBSERVACION PESOS MEXICANOS 1722.0 9900.0 1722.0 2022-02-28 2020-04-21 NaT 2021-12-31 2022-02-28 2020-04-21 2020-08-18 PAGO MINIMO PARA CUENTAS REVOLVENTES TIENDA DEPARTAMENTAL LINEA DE CREDITO REVOLVENTE INDIVIDUAL (TITULAR) 0.0 3 13 13.0 0 1722.0 1722.0 1722.0
2 4223 2022-04-01 00:32:36.689000+00:00 710278-27993a6e-2885-48d4-a753-9249e7022af1 2022-04-01 SIN OBSERVACION PESOS MEXICANOS 6346.0 0.0 482.0 2022-03-28 2021-12-17 NaT NaT 2022-03-28 2021-12-17 2022-03-26 SEMANAL BANCOS PRESTAMO PERSONAL PAGOS FIJOS INDIVIDUAL (TITULAR) 16.0 0 V 0.0 0 943.0 0.0 0.0
3 3490 2022-02-15 02:30:22.086000+00:00 622857-6b4e9d95-7491-40c3-bccd-442de7f94c58 2022-02-15 SIN OBSERVACION PESOS MEXICANOS 11600.0 0.0 232.0 2022-02-01 2021-03-02 NaT 2022-01-04 2022-02-01 2021-03-02 2022-01-25 SEMANAL TIENDA DE ROPA PRESTAMO PERSONAL PAGOS FIJOS INDIVIDUAL (TITULAR) 100.0 4 01 1.0 0 6185.0 116.0 116.0
4 6486 2022-06-25 01:57:14.868000+00:00 875073-46a5f149-19db-4193-b92a-b8f41fb3e896 2022-06-25 SIN OBSERVACION PESOS MEXICANOS 2452.0 16800.0 2452.0 2021-09-30 2019-02-22 NaT 2020-08-31 2021-09-30 2019-04-16 2019-06-12 PAGO MINIMO PARA CUENTAS REVOLVENTES TIENDA DEPARTAMENTAL LINEA DE CREDITO REVOLVENTE INDIVIDUAL (TITULAR) 0.0 19 13 13.0 0 2452.0 2452.0 2452.0
5 6486 2022-06-25 01:57:14.868000+00:00 875073-46a5f149-19db-4193-b92a-b8f41fb3e896 2022-06-25 SIN OBSERVACION PESOS MEXICANOS 0.0 0.0 0.0 2022-05-31 2021-11-30 NaT NaT 2022-05-31 2022-05-16 2022-05-16 PAGO MINIMO PARA CUENTAS REVOLVENTES SERVICIO DE TELEVISION DE PAGA OTROS SIN LIMITE PREESTABLECIDO INDIVIDUAL (TITULAR) 0.0 0 V 0.0 0 0.0 0.0 0.0
6 4075 2022-03-13 01:46:41.593000+00:00 693356-d32e8877-2ba0-48c0-8366-08d733aa974e 2022-03-13 SIN OBSERVACION PESOS MEXICANOS 7783.0 0.0 0.0 2004-09-23 2003-08-05 2004-04-27 NaT 2004-09-23 2003-08-05 2004-04-27 SEMANAL MERCANCIA PARA HOGAR Y OFICINA CREDITO AL CONSUMO PAGOS FIJOS INDIVIDUAL (TITULAR) 39.0 0 V 0.0 0 0.0 0.0 0.0
In [10]:
# --- Examine the basic statistics of the customer_loan_df
customer_loan_df.describe().T
Out[10]:
count mean min 25% 50% 75% max std
customer_id 14454.0 7227.5 1.0 3614.25 7227.5 10840.75 14454.0 4172.654731
loan_id 14454.0 7227.5 1.0 3614.25 7227.5 10840.75 14454.0 4172.654731
ACC_CREATION_DATETIME 14454 2022-06-17 07:24:49.443337216 2020-10-14 18:22:10 2022-02-21 18:46:22.249999872 2022-07-19 20:29:43.500000 2022-11-13 07:37:39.249999872 2023-05-19 19:55:04 NaN
APPLICATION_DATETIME 14454 2022-11-28 03:42:40.896637696 2022-04-26 07:00:00 2022-09-15 13:00:00 2022-12-20 08:00:00 2023-02-04 08:00:00 2023-05-26 07:00:00 NaN
LOAN_ORIGINATION_DATETIME 14454 2022-12-28 06:04:09.504220416 2022-07-01 09:03:20 2022-10-27 21:15:58.249999872 2023-01-11 10:05:49.500000 2023-03-06 18:07:46.249999872 2023-05-29 12:18:28 NaN
max_days_late 14454.0 14.225889 -7.0 0.0 2.0 20.0 70.0 21.738445
target 14454.0 0.1868 0.0 0.0 0.0 0.0 1.0 0.389764
account_to_application_days 14454.0 163.489 0.0 0.0 103.0 271.75 901.0 181.110989
n_sf_apps 6806.0 1.653982 1.0 1.0 1.0 2.0 42.0 1.697131
first_app_date 6806 2022-06-16 04:31:39.970614272 2021-04-27 00:00:00 2022-02-27 00:00:00 2022-07-15 00:00:00 2022-10-21 00:00:00 2023-05-12 00:00:00 NaN
last_app_date 6806 2022-06-16 01:42:11.531002112 2021-04-25 00:00:00 2022-02-25 00:00:00 2022-07-16 00:00:00 2022-10-22 00:00:00 2023-05-12 00:00:00 NaN
n_bnpl_apps 8739.0 1.221765 1.0 1.0 1.0 1.0 18.0 0.831144
n_bnpl_approved_apps 8739.0 0.264904 0.0 0.0 0.0 0.0 15.0 0.602481
first_bnpl_app_date 8739 2022-08-13 09:30:46.852090880 2022-01-06 21:17:08.193000 2022-05-01 21:03:56.963500032 2022-08-18 13:36:14.271000064 2022-11-06 19:24:55.189499904 2023-05-20 17:15:47 NaN
last_bnpl_app_date 8739 2022-08-03 09:11:07.589265408 2022-01-06 21:17:08.193000 2022-04-20 05:33:33.585999872 2022-07-28 17:37:41.676999936 2022-11-06 01:50:47.641999872 2023-05-17 15:20:48 NaN
n_inquiries_l3m 9083.0 10.350435 0.0 0.0 0.0 14.0 170.0 19.694595
n_inquiries_l6m 9083.0 17.11483 0.0 0.0 8.0 26.0 213.0 23.229088

Loan information

  • The average time a loan was late is around 14.2 days, with a maximum of 70 days late and a minimum of 7 days early.

Loan Creation and Application Dates

  • Most loan applications were made around November 28, 2022.
  • Loans were usually originated around December 28, 2022.

BNPL (Buy Now Pay Later) Applications:

  • On average, customers applied for BNPL 1.22 times, with some customers applying up to 18 times.
  • About 26.5% of BNPL applications were approved, with a maximum of 15 applications approved for a single customer.

SF (Smartphone Finance) Applications:

  • On average, customers applied for SF 1.65 times.

Credit Bureau Inquiries

  • On average, there were 10.35 inquiries in the last 3 months and 17.11 inquiries in the last 6 months.
  • The maximum inquiries made in the last 3 months and 6 months were 170 and 213, respectively.

Basic statistics¶

In [11]:
# --- Examine the basic statistics of the customer_loan_df
credit_report_df.describe().T
Out[11]:
count mean min 25% 50% 75% max std
customer_id 287356.0 6333.768075 1.0 2776.0 6187.0 9760.0 14416.0 3989.039684
INQUIRY_DATE 287267 2022-07-02 00:44:25.383771392 2021-04-29 00:00:00 2022-03-22 00:00:00 2022-07-18 00:00:00 2022-10-27 00:00:00 2023-05-17 00:00:00 NaN
MAX_CREDIT 287174.0 15280.383928 0.0 548.0 2300.0 7483.0 404040416.0 1313088.110101
CREDIT_LIMIT 278999.0 5525.841766 0.0 0.0 400.0 3000.0 1900000.0 27882.692051
PAYMENT_AMOUNT 287267.0 1670.948417 0.0 0.0 0.0 149.0 1800000.0 13010.321018
UPDATE_DATE 287267 2020-03-26 04:35:03.002433024 2001-04-25 00:00:00 2019-03-13 00:00:00 2021-11-22 00:00:00 2022-06-15 00:00:00 2023-05-07 00:00:00 NaN
LOAN_OPENING_DATE 287267 2018-11-22 15:24:26.870889984 1950-01-01 00:00:00 2017-03-21 00:00:00 2020-04-02 00:00:00 2021-10-14 00:00:00 2023-03-31 00:00:00 NaN
LOAN_CLOSING_DATE 192609 2018-10-24 21:45:09.910752 2000-08-28 00:00:00 2017-01-30 00:00:00 2020-01-27 00:00:00 2021-09-20 00:00:00 2023-03-31 00:00:00 NaN
WORST_DELAY_DATE 84658 2019-08-11 02:14:42.969123072 1998-10-10 00:00:00 2018-04-30 00:00:00 2021-05-31 00:00:00 2022-04-20 00:00:00 2023-04-12 00:00:00 NaN
REPORT_DATE 287267 2020-03-22 05:43:32.962853376 2001-04-25 00:00:00 2019-02-28 00:00:00 2021-11-10 00:00:00 2022-06-12 00:00:00 2023-05-07 00:00:00 NaN
LAST_PURCHASE_DATE 283214 2019-02-17 08:26:14.566228992 1950-01-01 00:00:00 2017-07-29 00:00:00 2020-08-03 00:00:00 2021-11-16 00:00:00 2023-04-07 00:00:00 NaN
LAST_PAYMENT_DATE 255679 2019-04-09 00:51:00.575174400 1950-01-01 00:00:00 2017-09-02 00:00:00 2020-09-06 00:00:00 2021-12-27 00:00:00 2023-05-07 00:00:00 NaN
TOTAL_PAYMENTS 268711.0 22.773876 0.0 1.0 3.0 16.0 1800.0 89.005749
WORST_DELAY 284146.0 4.27879 0.0 0.0 0.0 2.0 84.0 12.231546
CURRENT_BALANCE 287267.0 4578.482927 0.0 0.0 0.0 273.0 3469743.0 35543.251719
BALANCE_DUE 287267.0 2090.320639 0.0 0.0 0.0 0.0 1800000.0 14111.862995
BALANCE_DUE_WORST_DELAY 287267.0 1671.461922 0.0 0.0 0.0 159.0 1800000.0 11646.107604

Credit Limits & Usage

  • Notably, 25% of the records have a credit limit of zero, implying some customers might have credit records without an active credit line.
  • On average, users have a credit limit of 5,000 units. Understanding this average credit limit can assist us in creating more precise user clusters based on their financial capabilities.

Payments

  • 50% of the customers have not made any payment (median is zero), which might indicate that they either have no outstanding dues or have not started repayment.

Delays

  • On average, customers have a worst delay of 4.28 days in their payment history. However, some customers have delayed as much as 84 days.
  • Considering 50% of the records show no delay (median is zero), it implies that a significant portion of customers make timely payments, but there's a minority that skews the average with longer delays.

Potential Data Quality Issues:

  • Several columns, especially date columns, have NaN values for their standard deviation. This might suggest that there are either data quality issues or that the data is uniform for these columns.

Merging Datasets¶

Given that we're interested in a comprehensive view of each customer's credit behavior, an Inner Join would be the most straightforward and effective. This ensures that we only consider customers for whom we have completed information from both internal and external sources.

In [12]:
# --- Merge the datasets using the customer_id column
bankaya_merged = pd.merge(customer_loan_df,credit_report_df, on='customer_id',how='inner')

# --- Examine the first columns of the new dataset
print(f"The shape of the merged dataset is: {bankaya_merged.shape}")
print(f'previous customers amount: {len(customer_loan_df.customer_id.unique())}')
print(f'current customers amount: {len(bankaya_merged.customer_id.unique())}')
bankaya_merged.head(5)
The shape of the merged dataset is: (287356, 45)
previous customers amount: 14454
current customers amount: 9249
Out[12]:
customer_id loan_id ACC_CREATION_DATETIME APPLICATION_DATETIME LOAN_ORIGINATION_DATETIME max_days_late target account_to_application_days n_sf_apps first_app_date last_app_date n_bnpl_apps n_bnpl_approved_apps first_bnpl_app_date last_bnpl_app_date n_inquiries_l3m n_inquiries_l6m INQUIRY_TIME CDC_INQUIRY_ID INQUIRY_DATE PREVENTION_KEY CURRENCY MAX_CREDIT CREDIT_LIMIT PAYMENT_AMOUNT UPDATE_DATE LOAN_OPENING_DATE LOAN_CLOSING_DATE WORST_DELAY_DATE REPORT_DATE LAST_PURCHASE_DATE LAST_PAYMENT_DATE PAYMENT_FREQUENCY BUSINESS_TYPE CREDIT_TYPE ACCOUNT_TYPE RESPONSABILITY_TYPE TOTAL_PAYMENTS DELAYED_PAYMENTS CURRENT_PAYMENT WORST_DELAY TOTAL_REPORTED_PAYMENTS CURRENT_BALANCE BALANCE_DUE BALANCE_DUE_WORST_DELAY
0 1223 1 2021-08-23 13:57:56 2022-04-26 07:00:00 2022-10-20 18:26:35 5 0 245 1.0 2022-04-25 2022-04-25 1.0 0.0 2022-04-26 02:15:17.742 2022-04-26 02:15:17.742 91.0 91.0 2022-04-25 21:58:16.422000+00:00 221105-252f8dc6-04b8-41dc-92c3-bd9902478b8a 2022-04-25 SIN OBSERVACION PESOS MEXICANOS 3000.0 0.0 0.0 2017-04-19 2013-03-27 2017-02-21 NaT 2017-02-28 2017-02-21 2013-03-27 MENSUAL SERVICIOS LINEA DE CREDITO SIN LIMITE PREESTABLECIDO INDIVIDUAL (TITULAR) NaN 0 V NaN None 0.0 0.0 0.0
1 1223 1 2021-08-23 13:57:56 2022-04-26 07:00:00 2022-10-20 18:26:35 5 0 245 1.0 2022-04-25 2022-04-25 1.0 0.0 2022-04-26 02:15:17.742 2022-04-26 02:15:17.742 91.0 91.0 2022-04-25 21:58:16.422000+00:00 221105-252f8dc6-04b8-41dc-92c3-bd9902478b8a 2022-04-25 SIN OBSERVACION PESOS MEXICANOS 500.0 2200.0 0.0 2021-06-21 2021-04-03 2021-04-18 NaT 2021-06-21 2021-04-03 2021-04-18 MENSUAL COMPANIA DE PRESTAMO PERSONAL PRESTAMO PERSONAL PAGOS FIJOS INDIVIDUAL (TITULAR) 1.0 0 V 0.0 0 0.0 0.0 0.0
2 1223 1 2021-08-23 13:57:56 2022-04-26 07:00:00 2022-10-20 18:26:35 5 0 245 1.0 2022-04-25 2022-04-25 1.0 0.0 2022-04-26 02:15:17.742 2022-04-26 02:15:17.742 91.0 91.0 2022-04-25 21:58:16.422000+00:00 221105-252f8dc6-04b8-41dc-92c3-bd9902478b8a 2022-04-25 SIN OBSERVACION PESOS MEXICANOS 1100.0 2200.0 0.0 2021-06-21 2021-04-18 2021-04-22 NaT 2021-06-21 2021-04-18 2021-04-22 MENSUAL COMPANIA DE PRESTAMO PERSONAL PRESTAMO PERSONAL PAGOS FIJOS INDIVIDUAL (TITULAR) 1.0 0 V 0.0 0 0.0 0.0 0.0
3 1223 1 2021-08-23 13:57:56 2022-04-26 07:00:00 2022-10-20 18:26:35 5 0 245 1.0 2022-04-25 2022-04-25 1.0 0.0 2022-04-26 02:15:17.742 2022-04-26 02:15:17.742 91.0 91.0 2022-04-25 21:58:16.422000+00:00 221105-252f8dc6-04b8-41dc-92c3-bd9902478b8a 2022-04-25 SIN OBSERVACION PESOS MEXICANOS 1700.0 2500.0 0.0 2021-07-13 2021-04-22 2021-05-03 NaT 2021-07-13 2021-04-22 2021-05-03 MENSUAL COMPANIA DE PRESTAMO PERSONAL PRESTAMO PERSONAL PAGOS FIJOS INDIVIDUAL (TITULAR) 1.0 0 V 0.0 0 0.0 0.0 0.0
4 1223 1 2021-08-23 13:57:56 2022-04-26 07:00:00 2022-10-20 18:26:35 5 0 245 1.0 2022-04-25 2022-04-25 1.0 0.0 2022-04-26 02:15:17.742 2022-04-26 02:15:17.742 91.0 91.0 2022-04-25 21:58:16.422000+00:00 221105-252f8dc6-04b8-41dc-92c3-bd9902478b8a 2022-04-25 SIN OBSERVACION PESOS MEXICANOS 1800.0 2500.0 0.0 2021-07-13 2021-05-03 2021-05-26 NaT 2021-07-13 2021-05-03 2021-05-26 MENSUAL COMPANIA DE PRESTAMO PERSONAL PRESTAMO PERSONAL PAGOS FIJOS INDIVIDUAL (TITULAR) 1.0 0 V 0.0 0 0.0 0.0 0.0
In [13]:
# --- Examine the basic statistics of the merged dataframe
bankaya_merged.describe().T
Out[13]:
count mean min 25% 50% 75% max std
customer_id 287356.0 6333.768075 1.0 2776.0 6187.0 9760.0 14416.0 3989.039684
loan_id 287356.0 7695.095067 1.0 4697.0 7871.0 10909.0 14454.0 4033.276971
ACC_CREATION_DATETIME 287356 2022-05-06 23:16:59.809368576 2020-10-14 18:22:10 2021-12-25 14:08:44 2022-06-16 14:19:46 2022-10-05 23:13:30 2023-04-24 18:36:17 NaN
APPLICATION_DATETIME 287356 2022-12-09 03:45:22.400088832 2022-04-26 07:00:00 2022-10-22 07:00:00 2022-12-25 08:00:00 2023-02-05 08:00:00 2023-05-26 07:00:00 NaN
LOAN_ORIGINATION_DATETIME 287356 2023-01-11 20:22:31.556476928 2022-07-01 09:03:20 2022-12-03 15:19:37 2023-01-25 19:02:11 2023-03-11 11:12:19 2023-05-29 12:18:28 NaN
max_days_late 287356.0 12.781772 -7.0 0.0 2.0 15.0 70.0 20.510928
target 287356.0 0.167513 0.0 0.0 0.0 0.0 1.0 0.373434
account_to_application_days 287356.0 215.583043 0.0 67.0 173.0 324.0 901.0 180.294859
n_sf_apps 202674.0 1.649338 1.0 1.0 1.0 2.0 36.0 1.563695
first_app_date 202674 2022-06-24 06:29:30.638561792 2021-04-27 00:00:00 2022-03-11 00:00:00 2022-07-23 00:00:00 2022-10-24 00:00:00 2023-04-25 00:00:00 NaN
last_app_date 202674 2022-06-25 02:15:26.990141952 2021-04-28 00:00:00 2022-03-11 00:00:00 2022-07-26 00:00:00 2022-10-25 00:00:00 2023-04-25 00:00:00 NaN
n_bnpl_apps 267144.0 1.243479 1.0 1.0 1.0 1.0 18.0 0.863805
n_bnpl_approved_apps 267144.0 0.275219 0.0 0.0 0.0 0.0 15.0 0.601572
first_bnpl_app_date 267144 2022-08-20 00:47:41.840025088 2022-01-06 21:17:08.193000 2022-05-05 17:59:36.776999936 2022-09-02 23:50:13.225999872 2022-11-11 19:36:34.294000128 2023-05-20 17:15:47 NaN
last_bnpl_app_date 267144 2022-08-09 16:56:22.504993792 2022-01-06 21:17:08.193000 2022-04-22 00:44:14.604999936 2022-08-11 09:51:12.441999872 2022-11-11 03:03:34.395000064 2023-05-17 15:20:48 NaN
n_inquiries_l3m 285656.0 14.919267 0.0 0.0 0.0 24.0 170.0 25.800862
n_inquiries_l6m 285656.0 24.228747 0.0 0.0 15.0 39.0 213.0 29.497535
INQUIRY_DATE 287267 2022-07-02 00:44:25.383772416 2021-04-29 00:00:00 2022-03-22 00:00:00 2022-07-18 00:00:00 2022-10-27 00:00:00 2023-05-17 00:00:00 NaN
MAX_CREDIT 287174.0 15280.383928 0.0 548.0 2300.0 7483.0 404040416.0 1313088.110101
CREDIT_LIMIT 278999.0 5525.841766 0.0 0.0 400.0 3000.0 1900000.0 27882.692051
PAYMENT_AMOUNT 287267.0 1670.948417 0.0 0.0 0.0 149.0 1800000.0 13010.321018
UPDATE_DATE 287267 2020-03-26 04:35:03.002433536 2001-04-25 00:00:00 2019-03-13 00:00:00 2021-11-22 00:00:00 2022-06-15 00:00:00 2023-05-07 00:00:00 NaN
LOAN_OPENING_DATE 287267 2018-11-22 15:24:26.870890496 1950-01-01 00:00:00 2017-03-21 00:00:00 2020-04-02 00:00:00 2021-10-14 00:00:00 2023-03-31 00:00:00 NaN
LOAN_CLOSING_DATE 192609 2018-10-24 21:45:09.910751744 2000-08-28 00:00:00 2017-01-30 00:00:00 2020-01-27 00:00:00 2021-09-20 00:00:00 2023-03-31 00:00:00 NaN
WORST_DELAY_DATE 84658 2019-08-11 02:14:42.969123072 1998-10-10 00:00:00 2018-04-30 00:00:00 2021-05-31 00:00:00 2022-04-20 00:00:00 2023-04-12 00:00:00 NaN
REPORT_DATE 287267 2020-03-22 05:43:32.962853120 2001-04-25 00:00:00 2019-02-28 00:00:00 2021-11-10 00:00:00 2022-06-12 00:00:00 2023-05-07 00:00:00 NaN
LAST_PURCHASE_DATE 283214 2019-02-17 08:26:14.566229248 1950-01-01 00:00:00 2017-07-29 00:00:00 2020-08-03 00:00:00 2021-11-16 00:00:00 2023-04-07 00:00:00 NaN
LAST_PAYMENT_DATE 255679 2019-04-09 00:51:00.575174144 1950-01-01 00:00:00 2017-09-02 00:00:00 2020-09-06 00:00:00 2021-12-27 00:00:00 2023-05-07 00:00:00 NaN
TOTAL_PAYMENTS 268711.0 22.773876 0.0 1.0 3.0 16.0 1800.0 89.005749
WORST_DELAY 284146.0 4.27879 0.0 0.0 0.0 2.0 84.0 12.231546
CURRENT_BALANCE 287267.0 4578.482927 0.0 0.0 0.0 273.0 3469743.0 35543.251719
BALANCE_DUE 287267.0 2090.320639 0.0 0.0 0.0 0.0 1800000.0 14111.862995
BALANCE_DUE_WORST_DELAY 287267.0 1671.461922 0.0 0.0 0.0 159.0 1800000.0 11646.107604

1. Customer Behavior and Risk

Max Days Late:

  • The average max days late is around 12.78 days. However, the 75th percentile is at 15 days, and the maximum is at 70 days.

  • There's a segment of customers who are significantly late in making their payments. Identify the characteristics of these high-risk customers for targeted interventions.

Inquiries Last 3 Months (n_inquiries_l3m):

  • The mean number of inquiries in the last 3 months is approximately 14.92, but the max value is 170.
  • High number of inquiries could be an indicator of credit-seeking behavior and could be risky. Monitor the customers with high inquiries closely.

Total Payments (TOTAL_PAYMENTS):

  • The average number of total payments made is around 22.77, with a maximum of 1800.
  • A high number of payments could either indicate long-term, reliable customers or a fragmented payment behavior. Analyze this in conjunction with payment delays to better understand customer reliability.

Current Balance (CURRENT_BALANCE):

  • The mean current balance is approximately 4578.48, with a maximum going up to 3,469,743.
  • Some accounts have an unusually high balance. Investigate these outliers to ensure they are not errors or potential bad debts.

2. Portofolio Healthk

Account to Application Days (account_to_application_days):

  • The average time between account creation and application is 215.58 days.
  • Customers generally take about 7 months from account creation to loan application. Marketing can potentially target these customers around the 6-month mark for new loan products.

Payment Amount (PAYMENT_AMOUNT):

  • The mean payment amount is around 1670.95, with a maximum value of 1,800,000.
  • Like credit limits, payment amounts vary widely. Ensure that these large payment amounts are not errors and validate the creditworthiness of such customers.
In [14]:
from ML_functions.util_functions import inspection
In [15]:
# --- Inspect the Dataframe in order to determine cardinality, missing values, encoding strategy, imputing and transforming variables. 
inspection(bankaya_merged)
==== Number of labeles: Cardinality ====

ACCOUNT_TYPE contains 7 labels
ACC_CREATION_DATETIME contains 9236 labels
APPLICATION_DATETIME contains 368 labels
BUSINESS_TYPE contains 59 labels
CDC_INQUIRY_ID contains 9173 labels
CREDIT_TYPE contains 34 labels
CURRENCY contains 4 labels
CURRENT_PAYMENT contains 87 labels
DELAYED_PAYMENTS contains 98 labels
INQUIRY_DATE contains 626 labels
INQUIRY_TIME contains 9261 labels
LAST_PAYMENT_DATE contains 7269 labels
LAST_PURCHASE_DATE contains 7711 labels
LOAN_CLOSING_DATE contains 7168 labels
LOAN_OPENING_DATE contains 7944 labels
LOAN_ORIGINATION_DATETIME contains 9005 labels
PAYMENT_FREQUENCY contains 12 labels
PREVENTION_KEY contains 32 labels
REPORT_DATE contains 5814 labels
RESPONSABILITY_TYPE contains 6 labels
TOTAL_REPORTED_PAYMENTS contains 2 labels
UPDATE_DATE contains 5822 labels
WORST_DELAY_DATE contains 5692 labels
first_app_date contains 682 labels
first_bnpl_app_date contains 8507 labels
last_app_date contains 684 labels
last_bnpl_app_date contains 8507 labels
 
==== Types of the variables we are working with ====

customer_id                                  int64
loan_id                                      int64
ACC_CREATION_DATETIME               datetime64[ns]
APPLICATION_DATETIME                datetime64[ns]
LOAN_ORIGINATION_DATETIME           datetime64[ns]
max_days_late                                int64
target                                       int64
account_to_application_days                  int64
n_sf_apps                                  float64
first_app_date                      datetime64[ns]
last_app_date                       datetime64[ns]
n_bnpl_apps                                float64
n_bnpl_approved_apps                       float64
first_bnpl_app_date                 datetime64[ns]
last_bnpl_app_date                  datetime64[ns]
n_inquiries_l3m                            float64
n_inquiries_l6m                            float64
INQUIRY_TIME                   datetime64[ns, UTC]
CDC_INQUIRY_ID                              object
INQUIRY_DATE                        datetime64[ns]
PREVENTION_KEY                              object
CURRENCY                                    object
MAX_CREDIT                                 float64
CREDIT_LIMIT                               float64
PAYMENT_AMOUNT                             float64
UPDATE_DATE                         datetime64[ns]
LOAN_OPENING_DATE                   datetime64[ns]
LOAN_CLOSING_DATE                   datetime64[ns]
WORST_DELAY_DATE                    datetime64[ns]
REPORT_DATE                         datetime64[ns]
LAST_PURCHASE_DATE                  datetime64[ns]
LAST_PAYMENT_DATE                   datetime64[ns]
PAYMENT_FREQUENCY                           object
BUSINESS_TYPE                               object
CREDIT_TYPE                                 object
ACCOUNT_TYPE                                object
RESPONSABILITY_TYPE                         object
TOTAL_PAYMENTS                             float64
DELAYED_PAYMENTS                            object
CURRENT_PAYMENT                             object
WORST_DELAY                                float64
TOTAL_REPORTED_PAYMENTS                     object
CURRENT_BALANCE                            float64
BALANCE_DUE                                float64
BALANCE_DUE_WORST_DELAY                    float64
dtype: object 


Total Samples with missing values: 267458

====Total percentage of missing values per variable ====

WORST_DELAY_DATE               70.538983
LOAN_CLOSING_DATE              32.971993
last_app_date                  29.469369
n_sf_apps                      29.469369
first_app_date                 29.469369
TOTAL_REPORTED_PAYMENTS        14.595484
LAST_PAYMENT_DATE              11.023608
first_bnpl_app_date             7.033784
n_bnpl_approved_apps            7.033784
n_bnpl_apps                     7.033784
last_bnpl_app_date              7.033784
TOTAL_PAYMENTS                  6.488467
CREDIT_LIMIT                    2.908239
LAST_PURCHASE_DATE              1.441418
WORST_DELAY                     1.117081
n_inquiries_l3m                 0.591601
n_inquiries_l6m                 0.591601
MAX_CREDIT                      0.063336
PAYMENT_FREQUENCY               0.030972
REPORT_DATE                     0.030972
BALANCE_DUE                     0.030972
DELAYED_PAYMENTS                0.030972
BUSINESS_TYPE                   0.030972
CREDIT_TYPE                     0.030972
CURRENT_PAYMENT                 0.030972
RESPONSABILITY_TYPE             0.030972
CURRENT_BALANCE                 0.030972
ACCOUNT_TYPE                    0.030972
BALANCE_DUE_WORST_DELAY         0.030972
LOAN_OPENING_DATE               0.030972
UPDATE_DATE                     0.030972
PAYMENT_AMOUNT                  0.030972
CURRENCY                        0.030972
PREVENTION_KEY                  0.030972
INQUIRY_DATE                    0.030972
CDC_INQUIRY_ID                  0.030972
loan_id                         0.000000
INQUIRY_TIME                    0.000000
account_to_application_days     0.000000
target                          0.000000
max_days_late                   0.000000
LOAN_ORIGINATION_DATETIME       0.000000
APPLICATION_DATETIME            0.000000
ACC_CREATION_DATETIME           0.000000
customer_id                     0.000000
dtype: float64

 ===== Removed 2119 duplicate rows.====== 
 
In [16]:
# --- Replace None with the median value of the column
bankaya_merged['DELAYED_PAYMENTS'].fillna(bankaya_merged['DELAYED_PAYMENTS'].median(), inplace=True)
# --- convert the column to a int. 
bankaya_merged['DELAYED_PAYMENTS'] = bankaya_merged['DELAYED_PAYMENTS'].astype('int')

Strategy for Encoding Categorical Variables

Encoding categorical variables is crucial for preparing the data for the model. Below is the strategy for encoding each categorical variable based on the number of unique values:

Data Cleaning

  • Handle Missing Values: Variables like WORST_DELAY_DATE and LOAN_CLOSING_DATE have a significant number of missing values that need to be handled. The threshold for dropping is 40-50% missing values. By knowing that I decided to drop columns above that limit.
  • Imputation: For numerical variables, mean or median imputation will be used. For categorical variables, mode imputation is a typical method. K-NN imputation can also be used for more Advanced method, depending on the dataset.

Encoding Variables

  • High Cardinality (>5 unique categories): For categorical variables with many categories, I will use target encoding in order to reduce dimensionality such as prevention_key.
  • Low Cardinality(<5 categorical values): One-hot encoding for low cardinality categorical variables.

Note: Some variables like customer_id and loan_id may not be useful for modeling and can be dropped.

I found and removed 2119 duplicate rows from the dataset. Duplicate entries can lead to skewed analysis and incorrect conclusions. Therefore, it's crucial to maintain a clean and unique dataset for accurate business decision-making.


Exploratory Data Analysis¶

Univariate Analysis¶

In [17]:
# --- Duplicate the Ddatframe to do the EDA
bankaya_eda = bankaya_merged.copy()
In [18]:
#bankaya_merged.head(5000).to_csv('/Users/iouriolguin/Documents/test.csv')
#bankaya_merged.to_csv('/Users/iouriolguin/Documents/bankaya_merged.csv')
#bankaya_merged.to_parquet('/Users/iouriolguin/Documents/bankaya_merged.parquet')
In [19]:
# --- Identify all numerical variables in the dataset for the EDA analysis
numerical_vars = bankaya_eda.select_dtypes(include=['number']).columns.tolist()
numerical_vars = [var for var in numerical_vars if var not in ['customer_id', 'loan_id']] #---> remove id variables

# --- Identify all categorical variables in the dataset
categorical_vars =  bankaya_eda.select_dtypes(include=['object']).columns.tolist()
categorical_vars.remove("CDC_INQUIRY_ID")

# --- Identify the date columns 
date_vars = [col for col in bankaya_eda.columns if 'date' in col.lower() or 'datetime' in col.lower()]
In [20]:
# --- Univariate Analysis of the target variable

# Plotting the distribution of 'outcome_type'
fig = px.pie(bankaya_eda["target"], names='target', title='Distribution of Default')
fig.show()
  • The dataset is imbalance so that I wil consider to treat this kind of dataset with a Resampling technique such as Oversampling or Undersampling. In this case, I will use SMOTE (Synthetic Minority Oversampling Technique).
In [21]:
# --- Create a histogram for certain variables
lst_histogram = ["max_days_late","account_to_application_days","MAX_CREDIT","TOTAL_PAYMENTS",'DELAYED_PAYMENTS','n_bnpl_apps','n_inquiries_l3m']
for col in lst_histogram:
    plot_histogram(bankaya_eda,col)
  • max_days_late: This variable mostly has low values, indicating that most customers are not significantly late on their payments.
  • account_to_application_days: The distribution shows that most customers apply for a loan shortly after creating an account.
  • TOTAL_PAYMENTS: This variable mostly takes on low values, suggesting that most customers have made few total payments.
  • MAX_CREDIT: The majority of values for "MAX_CREDIT" are concentrated around a specific range, with a sharp peak evident in the histogram. This suggests that a large number of customers or instances have a credit limit (or maximum credit value) within this specific range.
In [22]:
# Group by the application date and count the number of applications
application_time_series = bankaya_eda.groupby('APPLICATION_DATETIME').size().reset_index(name='Number_of_Applications')

# --- Time Series Plot for the most relevant KPIs
plot_time_series(application_time_series,'Number_of_Applications','APPLICATION_DATETIME','D','mean')
plot_time_series(bankaya_eda,'account_to_application_days','ACC_CREATION_DATETIME','D','sum')
plot_time_series(bankaya_eda,'n_sf_apps','LOAN_ORIGINATION_DATETIME','M','mean')
plot_time_series(bankaya_eda,'n_bnpl_apps','first_bnpl_app_date','M','mean')
plot_time_series(bankaya_eda,'n_inquiries_l3m','INQUIRY_DATE','M','mean')
plot_time_series(bankaya_eda,'TOTAL_PAYMENTS','LOAN_ORIGINATION_DATETIME','D','mean')

In order to plot the Time Series I use a Relevance to Objective strategy. The date column should be closely related to what I am trying to analyze. By doing this the date columns should aligned witht the objective or question to answer.

APPLICATION_DATETIME

  • The sharp spike around November 2022 to January 2023 is particularly noteworthy. It suggests some external factor or event that caused a sudden increase in loan applications. This could be due to factors like promotional offers, economic events, changes in lending criteria, or a broader financial situation impacting the economy.
  • There doesn't seem to be a clear seasonality or repeating pattern in the data within the provided timeframe.
  • The plot indicates that the number of loan applications was relatively stable and low between May 2022 to around November 2022.

Univariate Analysis of Categorical Variables¶

In [23]:
# --- Read the json file created in the prject in order to map and group all the Business types
# to reduce the cardinality in this variable

# Reading JSON from a file
with open("input/businees_type_catalog.json", 'r') as f:
    business_catalog_dict = json.load(f)

# Reverse the mapping to make it usable in `replace`
reverse_mapping = {k: old_k for old_k, old_v in business_catalog_dict.items() for k in old_v}

# Update column in the dataframe with the business categories
bankaya_eda['BUSINESS_TYPE'] = bankaya_eda['BUSINESS_TYPE'].replace(reverse_mapping)
In [24]:
# --- Pie Charts for Categorical Variables
# ---Remove the date features from the category variables
filtered_categorical_vars = list(set(categorical_vars) - set(date_vars))
filtered_categorical_vars = list(set(filtered_categorical_vars) - set(["CREDIT_TYPE","BUSINESS_TYPE","CURRENT_PAYMENT","PREVENTION_KEY"])) 

# --- Plot relevant category variables
for var in filtered_categorical_vars:
    plot_pie(bankaya_eda,var)

bivariate-multivariate-analysis¶

In [25]:
from ML_functions.plots import cross_values_plot
In [26]:
cross_values_plot(bankaya_eda,"max_days_late","BUSINESS_TYPE","ACCOUNT_TYPE",temp_col_name="days_late")
  • Financial Services: Dominated by "PAGOS FIJOS" at 90.24%, indicating a significant risk in this sector with fixed payment schemes.
  • Automotive and Financial Services sectors need closer monitoring due to their high fixed payment latencies.
  • Reevaluate credit policies for the "PAGOS FIJOS" account type in sectors with significant latencies.
In [27]:
# --- Investigate the average max_days_late for different 'account_to_application_days' bins
bins = [0, 7, 30, 90, 180, 365, float('inf')]
labels = ['0-7 days', '8-30 days', '31-90 days', '91-180 days', '181-365 days', 'Over a year']
bankaya_eda['Application_Timeframe'] = pd.cut(bankaya_eda['account_to_application_days'], bins=bins, labels=labels)
# ---  Calculate the average max_days_late for each bin
avg_max_days_late = bankaya_eda.groupby('Application_Timeframe')['max_days_late'].mean().reset_index()

# -- Plot the results
fig = px.bar(avg_max_days_late, x='Application_Timeframe', y='max_days_late', title='Average max_days_late by Application Timeframe')
fig.show()

Investigating the average max_days_late for people who apply for loans within 'X' days of creating an account until the apply for a loan.

  • A shorter time might indicate impulsiveness, which could be riskier.
  • As the timeframe extends from "91-180 days" to "Over a year", there is a slight decline in the average max_days_late. This could mean that customers who have spent more time with the platform before deciding to take a loan are more reliable in terms of payments.
  • As per the initial hypothesis, those who apply for loans very shortly after account creation (0-7 days) have, on average, the fewest days late on payments. This could suggest that the immediate decision to apply doesn't necessarily correlate with higher risk, contrary to the idea that impulsiveness might indicate risk.
In [28]:
# Calculate the average max_days_late for each RESPONSABILITY_TYPE
avg_max_days_late_by_responsability = bankaya_eda.groupby('RESPONSABILITY_TYPE')['max_days_late'].mean().reset_index()

# Plot the results
fig = px.bar(avg_max_days_late_by_responsability, x='RESPONSABILITY_TYPE', y='max_days_late', title='Average max_days_late by RESPONSABILITY_TYPE')
fig.show()

We could check if different responsibility types have different average max_days_late.

  • TITULAR CON AVAL: This category appears to represent primary borrowers with a guarantee or backing (AVAL). Interestingly, this category has the lowest average max_days_late among all categories, suggesting that having both a primary borrower and a guarantee may be a safer lending scenario.

  • MANCOMUNADO: This likely represents joint responsibility or co-signing of an account or debt. It's noticeable that this category has a similarly high average max_days_late as the "AVAL" category, suggesting co-borrowers might have equivalent risk levels as those who guarantee another's debt.

  • It might be beneficial to reassess the risk associated with the "AVAL" and "MANCOMUNADO" categories, given their high average max_days_late. This can help in refining lending strategies or terms for such categories.

In [29]:
from ML_functions.plots import plot_box, plot_cross_data
In [30]:
# Plot the distribution of CURRENT_BALANCE for different ACCOUNT_TYPEs
plot_box(bankaya_eda,'ACCOUNT_TYPE','max_days_late','Distribution of CURRENT_BALANCE by ACCOUNT_TYPE')

This could help in designing account features that help people maintain low balances.

  • Most account types show a median max_days_late close to or at zero, indicating that for many accounts, payments are made on time.
  • All account types have outliers with high max_days_late, which could indicate specific problematic accounts or customers with chronic late payments.
  • SIN LIMITE PRESTABLECIDO Displays a tightly packed interquartile range (IQR) close to zero, suggesting that most of these accounts maintain timely payments.However, there's a notable spread of outliers up to 60 days late.
  • This account type has a wider IQR, which implies more variability in payment punctuality. This could indicate a segment that either forgets or struggles to make fixed payments on time.

    Given the outliers across all account types, it would be beneficial to engage these high-risk groups with financial counseling, automated reminders, or even potential restructuring of payment terms.

In [31]:
# --- Read the json file created in the prject in order to map and group all the Credit types
# to reduce the cardinality in this variable

# Reading JSON from a file
with open("input/credit_type.json", 'r') as f:
    credit_catalog_dict = json.load(f)

# Reverse the mapping to make it usable in `replace`
reverse_mapping = {k: old_k for old_k, old_v in credit_catalog_dict.items() for k in old_v}

# Update column in the dataframe with the business categories
bankaya_eda['sub_credit_type'] = bankaya_eda['CREDIT_TYPE'].replace(reverse_mapping)
In [32]:
plot_cross_data(bankaya_eda,"max_days_late","sub_credit_type","ACCOUNT_TYPE","days later","mean")

By understanding which combinations of account and sub_credit types have higher days late on average, businesses can prioritize collections efforts on high-risk segments.

  • PAGOS FIJOS: This type is the most prevalent in almost every sub_credit_type, particularly dominant in "Credit Cards" and "PRESTAMO GARANTIZADO".
  • REVOLENTE: Notably prevalent in "Asset Financing", "Automotive Loans", and "Credit Cards".
  • The days late for this category seem to be spread out among several account types, with "PAGOS FIJOS" being the most significant contributor. Given the high interest and penalties often associated with credit cards, this could be an area of concern.
In [33]:
# --- Plot matrix correlation
matrix_correlation(bankaya_eda[numerical_vars],cols_to_exclude=None)
  • max_days_late: This suggests that as the maximum number of days a payment is late increases, the target value also tends to increase. This could indicate that this variable is a significant predictor for the target.
  • n_inquiries_3m and n_inquiries_6m: These two variables show a strong positive correlation of 0.69, which implies that they might be conveying similar information. If we want to avoid multicollinearity by using just one of them in the model where this might be an issue.
  • PAYMENT_AMOUNT and BALANCE_DUE have a high positive correlation of 0.79, suggesting that larger payment amounts are associated with higher due.
In [34]:
from ML_functions.util_functions import remove_highly_correlated_features
In [35]:
# --- Select the high correlated varoiables under a 0.7 threshold
high_correlated_lst = remove_highly_correlated_features(bankaya_eda[numerical_vars],threshold=0.7,cols_to_exclude=["target"])
high_correlated_lst
 === High correlated features to be removed from the Dataset====
Out[35]:
['WORST_DELAY', 'BALANCE_DUE', 'BALANCE_DUE_WORST_DELAY']

Data Preparation and Feature Engineering¶

In [36]:
# --- Select the the features to be used by removing high correlated variables
model_current_features = list(set(bankaya_merged.columns.to_list())- set(high_correlated_lst))
model_current_features = [var for var in model_current_features if 'id' not in var.lower() or "customer_id" in var]
In [37]:
# --- Create a new dataframe in order that not duplicate under the dataset
bankaya_model_df = bankaya_merged.copy()
bankaya_model_df = bankaya_model_df[model_current_features]
In [38]:
# --- Identify all numerical variables in the dataset for the modeling pprocesing. 
model_numerical_vars = bankaya_model_df.select_dtypes(include=['number']).columns.tolist() 

# --- Identify all categorical variables in the dataset
model_categorical_vars =  bankaya_model_df.select_dtypes(include=['object']).columns.tolist()

# --- Identify the date columns 
model_date_vars = [col for col in bankaya_model_df.columns if 'date' in col.lower() or 'datetime' in col.lower()]
In [39]:
# === Diagnostic outliers ===
for var in bankaya_model_df[model_numerical_vars]:
    if 'target' not in var and "customer_id" not in var:
        plot_diagnostic_outlier(bankaya_model_df[model_numerical_vars],var)
  • DELAYED_PAYMENTS The plot on the left appears to be a density plot (or Kernel Density Estimation) of the DELAYED_PAYMENTS variable. It shows that a vast majority of customers have very few delayed payments (close to 0). This suggests that a significant proportion of Bankaya's clientele is punctual with their payments.
  • account to application A significant number of customers seem to apply for credit very shortly after creating their accounts, as indicated by the peak near 0 days. As the number of days increases, the density decreases, indicating fewer customers apply for credit long after opening their accounts.
  • n_inquiries_l6m There are clear outliers depicted on the box plot, with customers making an unusually high number of inquiries compared to the majority. These outliers range roughly from 50 inquiries up to almost 200.
  • n_sf_apps The box plot distinctly portrays several outliers, with individuals submitting an unusually high number of special finance applications compared to the bulk of the dataset. These outliers range from roughly 3 applications upwards, with a few cases extending beyond 30.
  • max_days_late The box plot reveals numerous outliers. These individuals have a record of being significantly late on their payments, extending well beyond the typical range observed in the dataset. Some individuals have delays exceeding 40 days.

Customers that wait a significantly longer period before applying for credit might have different financial behaviors or needs. It would be beneficial for the bank to study these outliers separately. Understanding the timing of credit applications relative to account creation can provide valuable insights for optimizing the bank's credit approval process, marketing strategies, and customer service offerings.

It's crucial to understand the nature of these outliers before making any business decisions. They could either be anomalies that need to be corrected or valuable insights that could drive specific business strategies. After understanding the business we can apply the Z-score method to handle outliers

feature engineering¶

  • payment_to_due_ratio: This feature would represent how much of the due amount has been paid. It could be useful for understanding the payment behavior of an individual or entity. CURRENT_BALANCE and PAYMENT_AMOUNT.
  • Time to Loan Origination: It represents the duration in days between the application date (APPLICATION_DATETIME) and the loan origination date (LOAN_ORIGINATION_DATETIME).
  • Days between Last Payment and Last Purchase: Difference between LAST_PAYMENT_DATE and LAST_PURCHASE_DATE. A longer time might indicate financial distress or irregular payment behavior.
  • Delayed to Total Payments Ratio: Ratio of DELAYED_PAYMENTS to TOTAL_PAYMENTS. A higher ratio indicates a pattern of delayed payments, signaling a risky customer.
  • Number of Unique Credit Types by Customer: Count of unique CREDIT_TYPE for each customer_id. Multiple types of credit could either indicate financial savviness or over-leveraging.
  • Outstanding Balance: CREDIT_LIMIT - CURRENT_BALANCE. A higher outstanding balance relative to the credit limit could indicate financial distress.
  • Loan Life: Difference between LOAN_CLOSING_DATE and LOAN_OPENING_DATE. The customer paid off the loan quickly, possibly indicating financial stability and low risk or the loan had to be closed quickly due to default or restructuring, indicating high risk.

Customer Lifetime: Understanding customer tenure can help in segmenting the customer base for targeted marketing. Customer tenure segmentation can be an effective way to target different customer groups with tailored strategies. By segmenting my user I will be able to segment in a proper way my customers by doing a RFM (Recency, Frequency, Monetary) analysis used to calculate CLV and Customers with low CLV and high max_days_late could be considered high-risk.

For the prevention key feature I will use a weighted feature technique in order to give a weight to each category depending of the risk and then group the dataset by customer id. By doing this I will ensure the quality of the data.

For the Busniess type and Credit type I cleaned the categories and grouped them in order to avoid the dimensionality of the datset. I will create OHE for the general categories created for both variables. Regarding the categories from the dataset I will usem them to count how many unique busniess type and credit type the user has. The will be number busniess type and number credit type.

In [40]:
# --- Calculate the payment_to_due_ratio. This means how much of the current balance is covered by the payment amount. 
bankaya_model_df["payment_to_due_ratio"] = np.where(
    (bankaya_model_df["CURRENT_BALANCE"] == 0) & (bankaya_model_df["PAYMENT_AMOUNT"] == 0),
    0, round(bankaya_model_df["CURRENT_BALANCE"] / bankaya_model_df["PAYMENT_AMOUNT"],2))

# # --- Time to loan origination. It represents the duration in days between the application date and the 
# loan origination . 
bankaya_model_df["time_to_loan_days"] = -(bankaya_model_df["APPLICATION_DATETIME"] - bankaya_model_df["LOAN_ORIGINATION_DATETIME"]).dt.days

# --- Last payment days. A longer time might indicate financial distress or irregular payment behavior. 
bankaya_model_df["last_payment_days"]= abs(bankaya_model_df["LAST_PAYMENT_DATE"] - bankaya_model_df["LAST_PURCHASE_DATE"]).dt.days

# # Number of unique credit types by customer. Multiple types of credit could either indicate financial savviness or over-leveraging.
# #bankaya_model_df["n_credit_types"] = bankaya_model_df[]

# --- Outstanding balance. A higher outstanding balance relative to the credit limit could indicate financial distress.
bankaya_model_df["outstanding_balance"] = abs(bankaya_model_df["CREDIT_LIMIT"] - bankaya_model_df["CURRENT_BALANCE"])

# --- Loan life. The customer paid off the loan quickly, possibly indicating financial stability and low risk or 
# the loan had to be closed quickly due to default or restructuring, indicating high risk.
bankaya_model_df["loan_life_days"] = (bankaya_model_df["LOAN_CLOSING_DATE"] - bankaya_model_df["LOAN_OPENING_DATE"]).dt.days
In [41]:
# --- Read the json file created in the project in order to map and group all the Business types
# to reduce the cardinality in this variable

# Reading JSON from a file
with open("input/businees_type_catalog.json", 'r') as f:
    business_catalog_dict = json.load(f)

# Reverse the mapping to make it usable in `replace`
reverse_mapping = {k: old_k for old_k, old_v in business_catalog_dict.items() for k in old_v}

# Update column in the dataframe with the business categories
bankaya_model_df["gen_business_type"] = bankaya_model_df['BUSINESS_TYPE'].replace(reverse_mapping)


# --- Read the json file created in the prject in order to map and group all the Credit types
# to reduce the cardinality in this variable

# Reading JSON from a file
with open("input/credit_type.json", 'r') as f:
    credit_catalog_dict = json.load(f)

# Reverse the mapping to make it usable in `replace`
reverse_mapping = {k: old_k for old_k, old_v in credit_catalog_dict.items() for k in old_v}

# Update column in the dataframe with the business categories
bankaya_model_df['gen_credit_type'] = bankaya_model_df['CREDIT_TYPE'].replace(reverse_mapping)

# --- Reading JSON file from the prevention key scores

with open("input/prevention_key_scores.json","r") as f:
    risk_scores= json.load(f)
In [42]:
# --- Current features after the feature engenieering processing
cols_current_features =bankaya_model_df.columns.to_list()

# --- Remove unnecesary variables or varibales which weres used before. 
model_cols_to_remove = ["CURRENT_PAYMENT","TOTAL_REPORTED_PAYMENTS","CURRENT_PAYMENT",
                      "CREDIT_TYPE","BUSINESS_TYPE",'INQUIRY_TIME'] #+ model_date_vars

# --- Create the final list of variables to use for the model
model_remaining_features = list(set(cols_current_features) - set(model_cols_to_remove))
In [43]:
# --- Create the final datset which it will be grouped later to use for the model
bankaya_final_model_df = bankaya_model_df[model_remaining_features]

# ---- Lowercase all the columns in the datset
bankaya_final_model_df.columns = [col.lower() for col in bankaya_final_model_df.columns]
print(f"dataframe shape: {bankaya_final_model_df.shape}")
bankaya_final_model_df.head(5)
dataframe shape: (285237, 42)
Out[43]:
delayed_payments max_credit time_to_loan_days gen_credit_type credit_limit update_date n_inquiries_l3m prevention_key current_balance n_bnpl_apps outstanding_balance inquiry_date last_app_date n_sf_apps loan_origination_datetime last_payment_days payment_to_due_ratio worst_delay_date account_type first_app_date payment_frequency n_bnpl_approved_apps payment_amount account_to_application_days first_bnpl_app_date n_inquiries_l6m acc_creation_datetime loan_opening_date total_payments report_date responsability_type max_days_late target last_bnpl_app_date loan_closing_date currency last_purchase_date loan_life_days customer_id gen_business_type application_datetime last_payment_date
0 0 3000.0 178 Miscellaneous 0.0 2017-04-19 91.0 SIN OBSERVACION 0.0 1.0 0.0 2022-04-25 2022-04-25 1.0 2022-10-20 18:26:35 1427.0 0.0 NaT SIN LIMITE PREESTABLECIDO 2022-04-25 MENSUAL 0.0 0.0 245 2022-04-26 02:15:17.742 91.0 2021-08-23 13:57:56 2013-03-27 NaN 2017-02-28 INDIVIDUAL (TITULAR) 5 0 2022-04-26 02:15:17.742 2017-02-21 PESOS MEXICANOS 2017-02-21 1427.0 1223 General Services 2022-04-26 07:00:00 2013-03-27
1 0 500.0 178 Personal Loans 2200.0 2021-06-21 91.0 SIN OBSERVACION 0.0 1.0 2200.0 2022-04-25 2022-04-25 1.0 2022-10-20 18:26:35 15.0 0.0 NaT PAGOS FIJOS 2022-04-25 MENSUAL 0.0 0.0 245 2022-04-26 02:15:17.742 91.0 2021-08-23 13:57:56 2021-04-03 1.0 2021-06-21 INDIVIDUAL (TITULAR) 5 0 2022-04-26 02:15:17.742 2021-04-18 PESOS MEXICANOS 2021-04-03 15.0 1223 Financial Services 2022-04-26 07:00:00 2021-04-18
2 0 1100.0 178 Personal Loans 2200.0 2021-06-21 91.0 SIN OBSERVACION 0.0 1.0 2200.0 2022-04-25 2022-04-25 1.0 2022-10-20 18:26:35 4.0 0.0 NaT PAGOS FIJOS 2022-04-25 MENSUAL 0.0 0.0 245 2022-04-26 02:15:17.742 91.0 2021-08-23 13:57:56 2021-04-18 1.0 2021-06-21 INDIVIDUAL (TITULAR) 5 0 2022-04-26 02:15:17.742 2021-04-22 PESOS MEXICANOS 2021-04-18 4.0 1223 Financial Services 2022-04-26 07:00:00 2021-04-22
3 0 1700.0 178 Personal Loans 2500.0 2021-07-13 91.0 SIN OBSERVACION 0.0 1.0 2500.0 2022-04-25 2022-04-25 1.0 2022-10-20 18:26:35 11.0 0.0 NaT PAGOS FIJOS 2022-04-25 MENSUAL 0.0 0.0 245 2022-04-26 02:15:17.742 91.0 2021-08-23 13:57:56 2021-04-22 1.0 2021-07-13 INDIVIDUAL (TITULAR) 5 0 2022-04-26 02:15:17.742 2021-05-03 PESOS MEXICANOS 2021-04-22 11.0 1223 Financial Services 2022-04-26 07:00:00 2021-05-03
4 0 1800.0 178 Personal Loans 2500.0 2021-07-13 91.0 SIN OBSERVACION 0.0 1.0 2500.0 2022-04-25 2022-04-25 1.0 2022-10-20 18:26:35 23.0 0.0 NaT PAGOS FIJOS 2022-04-25 MENSUAL 0.0 0.0 245 2022-04-26 02:15:17.742 91.0 2021-08-23 13:57:56 2021-05-03 1.0 2021-07-13 INDIVIDUAL (TITULAR) 5 0 2022-04-26 02:15:17.742 2021-05-26 PESOS MEXICANOS 2021-05-03 23.0 1223 Financial Services 2022-04-26 07:00:00 2021-05-26
In [44]:
# --- Identify all categorical variables after including the new categories for the Business Type 
model_categorical_vars =  bankaya_final_model_df.select_dtypes(include=['object']).columns.tolist()

# --- Identify all numerical variables in the dataset for the modeling pprocesing. 
model_numerical_vars = bankaya_final_model_df.select_dtypes(include=['number']).columns.tolist() 

# --- Select the OHE variables to be transformed
model_ohe_vars = ["account_type","currency","responsability_type"]

# --- Select the TE variables to be transformed
model_te_vars = ["gen_credit_type","gen_business_type","payment_frequency"]

print(f"Total variables so far: {len(model_categorical_vars)+len(model_numerical_vars)}")
Total variables so far: 27
In [45]:
# --- Create dummie variables or OHE for the categorical variables
bankaya_model_df_ohe = pd.get_dummies(bankaya_final_model_df, columns=model_ohe_vars)

# Show the new columns created as a result of One-Hot Encoding to verify
new_columns = [col for col in bankaya_model_df_ohe.columns if any(col.startswith(prefix) for prefix in model_ohe_vars)]
#new_columns[:10]  # Displaying only the first 10 for brevity


# Initialize the target encoder
target_encoder = ce.TargetEncoder(cols=model_te_vars)

# Fit and transform to get the target encoded columns
bankaya_final_model_df_encoded = target_encoder.fit_transform(
    bankaya_model_df_ohe[model_te_vars], 
    bankaya_model_df_ohe['target']
)

# Combine the encoded columns with the rest of your dataframe (excluding original categorical columns)
bankaya_final_model_df = pd.concat(
    [
        bankaya_model_df_ohe.drop(model_te_vars, axis=1), 
        bankaya_final_model_df_encoded
    ], 
    axis=1
)
In [46]:
# --- Create a dictionary for the aggregation to be done. 
aggregations = {}

# --- Loop through each column to decide the aggregation method
for col in bankaya_final_model_df.columns:
    if col == 'customer_id':
        continue  # Skip the customer_id column
    if bankaya_final_model_df[col].dtype == 'object':
        aggregations[col] = ['nunique']  # Number of unique categories for categorical features
    elif 'datetime' in str(bankaya_final_model_df[col].dtype):
        aggregations[col] = ['min', 'max']  # Min and Max for datetime features
    else:
        aggregations[col] = ['mean', 'sum', 'count']  # Mean, Sum, and Count for numerical features


# --- Mapping the prevention key with the scores given. 
bankaya_final_model_df['prevention_key_risk_score'] = bankaya_final_model_df['prevention_key'].map(risk_scores)

# --- Update the aggregations dictionary to include the new risk score column
aggregations['prevention_key_risk_score'] = ['mean', 'sum', 'count']


# Group by 'customer_id' and aggregate using the defined aggregations
grouped_df = bankaya_final_model_df.groupby('customer_id').agg(aggregations).reset_index()

# Flatten the MultiIndex in columns
grouped_df.columns = ['_'.join(col).strip() for col in grouped_df.columns.values]

# Rename the 'customer_id_' column to 'customer_id'
grouped_df.rename(columns={'customer_id_': 'customer_id'}, inplace=True)

# Display the first few rows of the grouped dataframe
grouped_df


#acc_creation_datetime_min	 - # --- Create a dictionary for the aggregation to be done. 
aggregations = {}

# --- Loop through each column to decide the aggregation method
for col in bankaya_final_model_df.columns:
    if col == 'customer_id':
        continue  # Skip the customer_id column
    if bankaya_final_model_df[col].dtype == 'object':
        aggregations[col] = ['nunique']  # Number of unique categories for categorical features
    elif 'datetime' in str(bankaya_final_model_df[col].dtype):
        aggregations[col] = ['min', 'max']  # Min and Max for datetime features
    else:
        aggregations[col] = ['mean', 'sum', 'count']  # Mean, Sum, and Count for numerical features


# --- Mapping the prevention key with the scores given. 
bankaya_final_model_df['prevention_key_risk_score'] = bankaya_final_model_df['prevention_key'].map(risk_scores)

# --- Update the aggregations dictionary to include the new risk score column
aggregations['prevention_key_risk_score'] = ['mean', 'sum', 'count']


# Group by 'customer_id' and aggregate using the defined aggregations
grouped_df = bankaya_final_model_df.groupby('customer_id').agg(aggregations).reset_index()

# Flatten the MultiIndex in columns
grouped_df.columns = ['_'.join(col).strip() for col in grouped_df.columns.values]

# Rename the 'customer_id_' column to 'customer_id'
grouped_df.rename(columns={'customer_id_': 'customer_id'}, inplace=True)

# Display the first few rows of the grouped dataframe
grouped_df


#acc_creation_datetime_min	 - # --- Create a dictionary for the aggregation to be done. 
aggregations = {}

# --- Loop through each column to decide the aggregation method
for col in bankaya_final_model_df.columns:
    if col == 'customer_id':
        continue  # Skip the customer_id column
    if bankaya_final_model_df[col].dtype == 'object':
        aggregations[col] = ['nunique']  # Number of unique categories for categorical features
    elif 'datetime' in str(bankaya_final_model_df[col].dtype):
        aggregations[col] = ['min', 'max']  # Min and Max for datetime features
    else:
        aggregations[col] = ['mean', 'sum', 'count']  # Mean, Sum, and Count for numerical features


# --- Mapping the prevention key with the scores given. 
bankaya_final_model_df['prevention_key_risk_score'] = bankaya_final_model_df['prevention_key'].map(risk_scores)

# --- Update the aggregations dictionary to include the new risk score column
aggregations['prevention_key_risk_score'] = ['mean', 'sum', 'count']


# Group by 'customer_id' and aggregate using the defined aggregations
grouped_df = bankaya_final_model_df.groupby('customer_id').agg(aggregations).reset_index()

# Flatten the MultiIndex in columns
grouped_df.columns = ['_'.join(col).strip() for col in grouped_df.columns.values]

# Rename the 'customer_id_' column to 'customer_id'
grouped_df.rename(columns={'customer_id_': 'customer_id'}, inplace=True)

# --- Create customer lifetime feature

grouped_df['customer_lifetime'] = abs(grouped_df["acc_creation_datetime_min"] - grouped_df["application_datetime_max"]).dt.days

# Display the first few rows of the grouped dataframe
grouped_df


# customer_lifespin = acc_creation_datetime_min	 - application_datetime_max
Out[46]:
customer_id delayed_payments_mean delayed_payments_sum delayed_payments_count max_credit_mean max_credit_sum max_credit_count time_to_loan_days_mean time_to_loan_days_sum time_to_loan_days_count credit_limit_mean credit_limit_sum credit_limit_count update_date_min update_date_max n_inquiries_l3m_mean n_inquiries_l3m_sum n_inquiries_l3m_count prevention_key_nunique current_balance_mean current_balance_sum current_balance_count n_bnpl_apps_mean n_bnpl_apps_sum n_bnpl_apps_count outstanding_balance_mean outstanding_balance_sum outstanding_balance_count inquiry_date_min inquiry_date_max last_app_date_min last_app_date_max n_sf_apps_mean n_sf_apps_sum n_sf_apps_count loan_origination_datetime_min loan_origination_datetime_max last_payment_days_mean last_payment_days_sum last_payment_days_count payment_to_due_ratio_mean payment_to_due_ratio_sum payment_to_due_ratio_count worst_delay_date_min worst_delay_date_max first_app_date_min first_app_date_max n_bnpl_approved_apps_mean n_bnpl_approved_apps_sum n_bnpl_approved_apps_count payment_amount_mean payment_amount_sum payment_amount_count account_to_application_days_mean account_to_application_days_sum account_to_application_days_count first_bnpl_app_date_min first_bnpl_app_date_max n_inquiries_l6m_mean n_inquiries_l6m_sum n_inquiries_l6m_count acc_creation_datetime_min acc_creation_datetime_max loan_opening_date_min loan_opening_date_max total_payments_mean total_payments_sum total_payments_count report_date_min report_date_max max_days_late_mean max_days_late_sum max_days_late_count target_mean target_sum target_count last_bnpl_app_date_min last_bnpl_app_date_max loan_closing_date_min loan_closing_date_max last_purchase_date_min last_purchase_date_max loan_life_days_mean loan_life_days_sum loan_life_days_count application_datetime_min application_datetime_max last_payment_date_min last_payment_date_max account_type_CREDITO DE HABILITACION O AVIO_mean account_type_CREDITO DE HABILITACION O AVIO_sum account_type_CREDITO DE HABILITACION O AVIO_count account_type_HIPOTECA_mean account_type_HIPOTECA_sum account_type_HIPOTECA_count account_type_PAGOS FIJOS_mean account_type_PAGOS FIJOS_sum account_type_PAGOS FIJOS_count account_type_POR DETERMINAR_mean account_type_POR DETERMINAR_sum account_type_POR DETERMINAR_count account_type_REVOLVENTE_mean account_type_REVOLVENTE_sum account_type_REVOLVENTE_count account_type_SIN LIMITE PREESTABLECIDO_mean account_type_SIN LIMITE PREESTABLECIDO_sum account_type_SIN LIMITE PREESTABLECIDO_count currency_DOLARES_mean currency_DOLARES_sum currency_DOLARES_count currency_PESOS MEXICANOS_mean currency_PESOS MEXICANOS_sum currency_PESOS MEXICANOS_count currency_UNIDADES DE INVERSION_mean currency_UNIDADES DE INVERSION_sum currency_UNIDADES DE INVERSION_count responsability_type_AVAL_mean responsability_type_AVAL_sum responsability_type_AVAL_count responsability_type_INDIVIDUAL (TITULAR)_mean responsability_type_INDIVIDUAL (TITULAR)_sum responsability_type_INDIVIDUAL (TITULAR)_count responsability_type_MANCOMUNADO_mean responsability_type_MANCOMUNADO_sum responsability_type_MANCOMUNADO_count responsability_type_OBLIGATORIO SOLIDARIO_mean responsability_type_OBLIGATORIO SOLIDARIO_sum responsability_type_OBLIGATORIO SOLIDARIO_count responsability_type_TITULAR CON AVAL_mean responsability_type_TITULAR CON AVAL_sum responsability_type_TITULAR CON AVAL_count gen_credit_type_mean gen_credit_type_sum gen_credit_type_count gen_business_type_mean gen_business_type_sum gen_business_type_count payment_frequency_mean payment_frequency_sum payment_frequency_count prevention_key_risk_score_mean prevention_key_risk_score_sum prevention_key_risk_score_count customer_lifetime
0 1 0.125000 1 8 21257.000000 170056.0 8 4.0 32 8 69080.000000 552640.0 8 2018-01-24 2022-06-11 NaN 0.0 0 2 2246.000000 17968.0 8 2.0 16.0 8 66834.000000 534672.0 8 2022-06-16 2022-06-16 2022-06-16 2022-06-16 1.0 8.0 8 2023-02-16 11:44:50 2023-02-16 11:44:50 76.250000 610.0 8 4.272500 34.18 8 NaT NaT 2022-06-16 2022-06-16 2.0 16.0 8 202.625000 1621.0 8 851.0 6808 8 2022-10-27 19:15:56.544 2022-10-27 19:15:56.544 NaN 0.0 0 2020-10-14 18:22:10 2020-10-14 18:22:10 2003-08-21 2017-04-21 1.000000 1.0 1 2017-11-30 2022-05-31 0.0 0 8 0.0 0 8 2022-07-21 17:38:17.667 2022-07-21 17:38:17.667 2017-12-19 2018-04-05 2017-11-30 2022-05-31 1341.500000 2683.0 2 2023-02-13 08:00:00 2023-02-13 08:00:00 2017-11-28 2022-05-27 0.0 0 8 0.00000 0 8 0.000000 0 8 0.000000 0 8 0.750000 6 8 0.250000 2 8 0.0 0 8 1.0 8 8 0.0 0 8 0.0 0 8 1.000000 8 8 0.000000 0 8 0.000000 0 8 0.000000 0 8 0.178660 1.429281 8 0.173792 1.390340 8 0.162429 1.299428 8 1.250000 10.0 8 851
1 2 0.000000 0 10 15227.500000 152275.0 10 2.0 20 10 37566.666667 338100.0 9 2020-04-09 2022-04-25 0.0 0.0 10 4 2577.900000 25779.0 10 1.0 10.0 10 34702.333333 312321.0 9 2022-05-02 2022-05-02 2022-11-09 2022-11-09 8.0 80.0 10 2023-01-27 16:17:49 2023-01-27 16:17:49 242.500000 2425.0 10 NaN NaN 10 NaT NaT 2022-05-02 2022-05-02 1.0 10.0 10 168.900000 1689.0 10 828.0 8280 10 2022-07-01 16:50:01.030 2022-07-01 16:50:01.030 0.0 0.0 10 2020-10-19 12:11:43 2020-10-19 12:11:43 2009-08-15 2021-07-26 18.500000 37.0 2 2020-03-31 2022-04-19 -1.0 -10 10 0.0 0 10 2022-07-01 16:50:01.030 2022-07-01 16:50:01.030 2020-03-11 2022-01-21 2016-01-05 2022-04-14 1256.333333 7538.0 6 2023-01-26 08:00:00 2023-01-26 08:00:00 2019-11-26 2022-04-16 0.0 0 10 0.00000 0 10 0.100000 1 10 0.000000 0 10 0.900000 9 10 0.000000 0 10 0.0 0 10 1.0 10 10 0.0 0 10 0.0 0 10 1.000000 10 10 0.000000 0 10 0.000000 0 10 0.000000 0 10 0.183405 1.834049 10 0.165376 1.653760 10 0.162429 1.624285 10 2.000000 20.0 10 828
2 3 2.656250 85 32 14743.343750 471787.0 32 11.0 352 32 16751.250000 536040.0 32 2014-06-16 2022-10-12 0.0 0.0 32 4 14773.718750 472759.0 32 1.0 32.0 32 3392.906250 108573.0 32 2022-10-19 2022-10-19 NaT NaT NaN 0.0 0 2023-02-02 18:21:12 2023-02-02 18:21:12 91.269231 2373.0 26 4.142500 132.56 32 2021-12-31 2022-07-31 NaT NaT 0.0 0.0 32 379.468750 12143.0 32 818.0 26176 32 2022-10-19 03:51:19.918 2022-10-19 03:51:19.918 31.0 992.0 32 2020-10-26 11:03:57 2020-10-26 11:03:57 2014-05-03 2021-12-06 12.812500 410.0 32 2014-06-16 2022-09-30 63.0 2016 32 1.0 32 32 2022-10-19 03:51:19.918 2022-10-19 03:51:19.918 2014-06-16 2022-08-31 2014-05-03 2021-12-06 67.090909 1476.0 22 2023-01-23 08:00:00 2023-01-23 08:00:00 2014-06-16 2022-09-07 0.0 0 32 0.03125 1 32 0.906250 29 32 0.062500 2 32 0.000000 0 32 0.000000 0 32 0.0 0 32 1.0 32 32 0.0 0 32 0.0 0 32 1.000000 32 32 0.000000 0 32 0.000000 0 32 0.000000 0 32 0.165853 5.307290 32 0.167984 5.375483 32 0.163515 5.232486 32 1.500000 48.0 32 818
3 4 0.153846 2 13 3911.923077 50855.0 13 23.0 299 13 2858.076923 37155.0 13 2020-10-07 2022-01-11 0.0 0.0 13 2 3447.923077 44823.0 13 7.0 91.0 13 2155.846154 28026.0 13 2022-01-14 2022-01-14 2022-01-21 2022-01-21 6.0 78.0 13 2022-10-06 13:07:47 2022-10-06 13:07:47 49.181818 541.0 11 4.870769 63.32 13 2021-11-30 2021-11-30 2022-01-14 2022-01-14 0.0 0.0 13 421.307692 5477.0 13 687.0 8931 13 2022-03-14 01:34:34.104 2022-03-14 01:34:34.104 0.0 0.0 13 2020-10-26 12:47:09 2020-10-26 12:47:09 2020-06-30 2021-12-15 5.900000 59.0 10 2020-09-30 2022-01-03 28.0 364 13 0.0 0 13 2022-02-19 22:16:14.775 2022-02-19 22:16:14.775 2020-09-01 2021-12-15 2020-06-30 2021-12-26 26.333333 158.0 6 2022-09-14 07:00:00 2022-09-14 07:00:00 2020-09-01 2022-01-03 0.0 0 13 0.00000 0 13 0.769231 10 13 0.000000 0 13 0.230769 3 13 0.000000 0 13 0.0 0 13 1.0 13 13 0.0 0 13 0.0 0 13 1.000000 13 13 0.000000 0 13 0.000000 0 13 0.000000 0 13 0.169715 2.206294 13 0.169220 2.199859 13 0.164236 2.135074 13 1.076923 14.0 13 687
4 5 6.204082 304 49 27515.775510 1348273.0 49 28.0 1372 49 18409.478261 846836.0 46 2007-08-01 2021-10-01 0.0 0.0 49 9 12025.612245 589255.0 49 NaN 0.0 0 10717.413043 493001.0 46 2021-10-04 2021-10-04 2021-10-04 2021-10-04 1.0 49.0 49 2022-10-18 13:12:47 2022-10-18 13:12:47 267.133333 12021.0 45 2.582857 126.56 49 2007-07-29 2021-08-31 2021-10-04 2021-10-04 NaN 0.0 0 9184.857143 450058.0 49 694.0 34006 49 NaT NaT 0.0 0.0 49 2020-10-26 20:13:57 2020-10-26 20:13:57 2002-06-18 2021-05-14 48.621622 1799.0 37 2007-08-01 2021-09-30 6.0 294 49 0.0 0 49 NaT NaT 2007-08-01 2021-08-17 2007-03-25 2021-09-04 528.000000 14256.0 27 2022-09-21 07:00:00 2022-09-21 07:00:00 2007-08-01 2021-09-24 0.0 0 49 0.00000 0 49 0.408163 20 49 0.000000 0 49 0.367347 18 49 0.224490 11 49 0.0 0 49 1.0 49 49 0.0 0 49 0.0 0 49 1.000000 49 49 0.000000 0 49 0.000000 0 49 0.000000 0 49 0.172096 8.432726 49 0.165770 8.122741 49 0.166272 8.147327 49 3.142857 154.0 49 694
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
9244 14391 2.538462 66 26 18744.423077 487355.0 26 7.0 182 26 4722.269231 122779.0 26 2019-10-31 2023-04-06 52.0 1352.0 26 3 541.576923 14081.0 26 1.0 26.0 26 5141.538462 133680.0 26 2023-04-21 2023-04-21 2023-04-20 2023-04-20 1.0 26.0 26 2023-04-27 15:08:48 2023-04-27 15:08:48 106.333333 2552.0 24 0.192308 5.00 26 2020-02-29 2023-02-28 2023-04-20 2023-04-20 0.0 0.0 26 541.576923 14081.0 26 0.0 0 26 2023-04-21 01:37:54.043 2023-04-21 01:37:54.043 52.0 1352.0 26 2023-04-20 20:01:02 2023-04-20 20:01:02 2019-05-31 2022-08-26 15.000000 345.0 23 2019-10-31 2023-03-31 21.0 546 26 0.0 0 26 2023-04-21 01:37:54.043 2023-04-21 01:37:54.043 2019-10-14 2023-03-30 2019-06-06 2023-03-01 212.571429 4464.0 21 2023-04-21 07:00:00 2023-04-21 07:00:00 2019-10-13 2023-03-01 0.0 0 26 0.00000 0 26 0.730769 19 26 0.038462 1 26 0.076923 2 26 0.153846 4 26 0.0 0 26 1.0 26 26 0.0 0 26 0.0 0 26 0.807692 21 26 0.038462 1 26 0.153846 4 26 0.000000 0 26 0.170023 4.420607 26 0.170693 4.438015 26 0.168942 4.392479 26 2.461538 64.0 26 0
9245 14404 9.115385 237 26 4369.269231 113601.0 26 3.0 78 26 3129.653846 81371.0 26 2005-12-18 2023-04-19 28.0 728.0 26 5 2609.807692 67855.0 26 1.0 26.0 26 2935.692308 76328.0 26 2023-04-23 2023-04-23 NaT NaT NaN 0.0 0 2023-04-27 17:24:48 2023-04-27 17:24:48 71.208333 1709.0 24 1.433077 37.26 26 2005-11-01 2023-03-31 NaT NaT 0.0 0.0 26 2249.076923 58476.0 26 2.0 52 26 2023-04-23 06:05:02.883 2023-04-23 06:05:02.883 28.0 728.0 26 2023-04-22 20:37:42 2023-04-22 20:37:42 2005-05-31 2023-02-28 16.416667 394.0 24 2005-12-18 2023-04-19 -1.0 -26 26 0.0 0 26 2023-04-23 06:05:02.883 2023-04-23 06:05:02.883 2005-12-18 2023-03-06 2005-05-31 2023-03-11 128.937500 2063.0 16 2023-04-25 07:00:00 2023-04-25 07:00:00 2005-12-18 2023-04-10 0.0 0 26 0.00000 0 26 0.769231 20 26 0.000000 0 26 0.115385 3 26 0.115385 3 26 0.0 0 26 1.0 26 26 0.0 0 26 0.0 0 26 1.000000 26 26 0.000000 0 26 0.000000 0 26 0.000000 0 26 0.167796 4.362688 26 0.167519 4.355506 26 0.167344 4.350938 26 2.923077 76.0 26 2
9246 14408 35.250000 141 4 29629.000000 118516.0 4 4.0 16 4 0.000000 0.0 4 2021-02-28 2023-04-09 10.0 40.0 4 2 7764.250000 31057.0 4 1.0 4.0 4 7764.250000 31057.0 4 2023-04-23 2023-04-23 2023-04-23 2023-04-23 1.0 4.0 4 2023-04-27 09:32:32 2023-04-27 09:32:32 83.000000 332.0 4 0.500000 2.00 4 2020-12-20 2021-03-13 2023-04-23 2023-04-23 0.0 0.0 4 7764.250000 31057.0 4 0.0 0 4 2023-04-23 18:06:58.837 2023-04-23 18:06:58.837 10.0 40.0 4 2023-04-23 12:53:02 2023-04-23 12:53:02 2019-03-10 2021-02-22 63.500000 254.0 4 2021-02-28 2023-04-09 23.0 92 4 0.0 0 4 2023-04-23 18:06:58.837 2023-04-23 18:06:58.837 2021-02-17 2021-06-30 2019-03-10 2021-02-22 113.500000 227.0 2 2023-04-24 07:00:00 2023-04-24 07:00:00 2019-04-27 2021-06-30 0.0 0 4 0.00000 0 4 1.000000 4 4 0.000000 0 4 0.000000 0 4 0.000000 0 4 0.0 0 4 1.0 4 4 0.0 0 4 0.0 0 4 1.000000 4 4 0.000000 0 4 0.000000 0 4 0.000000 0 4 0.168101 0.672403 4 0.162282 0.649126 4 0.170136 0.680544 4 5.500000 22.0 4 0
9247 14410 2.262821 353 156 2796.339744 436229.0 156 3.0 468 156 123.384615 19248.0 156 2011-07-27 2023-04-19 33.0 5148.0 156 5 183.089744 28562.0 156 1.0 156.0 156 209.846154 32736.0 156 2023-04-23 2023-04-23 2023-04-23 2023-04-23 1.0 156.0 156 2023-04-26 21:25:09 2023-04-26 21:25:09 89.793333 13469.0 150 0.160256 25.00 156 2011-07-20 2023-04-12 2023-04-23 2023-04-23 0.0 0.0 156 168.448718 26278.0 156 0.0 0 156 2023-04-23 20:54:13.506 2023-04-23 20:54:13.506 33.0 5148.0 156 2023-04-23 15:49:12 2023-04-23 15:49:12 2011-03-25 2023-02-19 4.993506 769.0 154 2011-07-27 2023-04-19 0.0 0 156 0.0 0 156 2023-04-23 20:54:13.506 2023-04-23 20:54:13.506 2011-07-22 2023-02-22 2011-03-25 2023-03-24 85.902098 12284.0 143 2023-04-24 07:00:00 2023-04-24 07:00:00 2011-07-21 2023-03-22 0.0 0 156 0.00000 0 156 0.237179 37 156 0.730769 114 156 0.025641 4 156 0.006410 1 156 0.0 0 156 1.0 156 156 0.0 0 156 0.0 0 156 0.974359 152 156 0.000000 0 156 0.012821 2 156 0.012821 2 156 0.166609 25.991020 156 0.158096 24.663003 156 0.169450 26.434217 156 1.256410 196.0 156 0
9248 14416 1.821429 51 28 20347.642857 569734.0 28 8.0 224 28 21060.392857 589691.0 28 2009-04-03 2023-04-02 11.0 308.0 28 3 5381.107143 150671.0 28 1.0 28.0 28 15679.285714 439020.0 28 2023-04-24 2023-04-24 2023-04-24 2023-04-24 1.0 28.0 28 2023-05-05 21:09:11 2023-05-05 21:09:11 223.400000 5585.0 25 0.107143 3.00 28 2008-07-30 2023-03-31 2023-04-24 2023-04-24 0.0 0.0 28 5381.107143 150671.0 28 3.0 84 28 2023-04-24 23:36:58.288 2023-04-24 23:36:58.288 11.0 308.0 28 2023-04-24 18:36:17 2023-04-24 18:36:17 2005-11-04 2023-03-17 11.925926 322.0 27 2009-04-03 2023-04-02 0.0 0 28 0.0 0 28 2023-04-24 23:36:58.288 2023-04-24 23:36:58.288 2006-10-31 2023-02-22 2005-11-04 2021-12-30 587.960000 14699.0 25 2023-04-28 07:00:00 2023-04-28 07:00:00 2006-10-31 2023-01-12 0.0 0 28 0.00000 0 28 0.678571 19 28 0.107143 3 28 0.142857 4 28 0.071429 2 28 0.0 0 28 1.0 28 28 0.0 0 28 0.0 0 28 1.000000 28 28 0.000000 0 28 0.000000 0 28 0.000000 0 28 0.157965 4.423030 28 0.157886 4.420809 28 0.165593 4.636598 28 1.357143 38.0 28 3

9249 rows × 144 columns

In [47]:
# --- Choose only the variables I use for the model

model_features_final = [
    'customer_id','target_mean','customer_lifetime','n_sf_apps_sum','prevention_key_nunique',
    'account_to_application_days_count', 'loan_life_days_mean', 'max_credit_sum',
    'total_payments_sum', 'n_inquiries_l3m_mean', 'payment_amount_sum',
    'payment_to_due_ratio_mean', 'time_to_loan_days_mean',
    'credit_limit_sum', 'n_bnpl_approved_apps_mean', 'outstanding_balance_mean',
    'n_inquiries_l6m_mean', 'n_bnpl_apps_sum', 'max_days_late_sum', 'last_payment_days_sum',
    'delayed_payments_sum', 'current_balance_sum', 'account_type_CREDITO DE HABILITACION O AVIO_sum',
    'account_type_HIPOTECA_sum', 'account_type_PAGOS FIJOS_sum', 'account_type_POR DETERMINAR_sum',
    'account_type_REVOLVENTE_sum', 'account_type_SIN LIMITE PREESTABLECIDO_sum', 'currency_DOLARES_sum',
    'currency_PESOS MEXICANOS_sum', 'currency_UNIDADES DE INVERSION_sum', 'responsability_type_AVAL_sum',
    'responsability_type_INDIVIDUAL (TITULAR)_sum', 'responsability_type_MANCOMUNADO_sum',
    'responsability_type_OBLIGATORIO SOLIDARIO_sum', 'responsability_type_TITULAR CON AVAL_sum',
    'prevention_key_risk_score_mean','gen_credit_type_mean','gen_business_type_mean','payment_frequency_mean'
]
In [48]:
# --- Create the final dataframe for the model
bankaya_grouped_df = grouped_df[model_features_final]

# ---- Lowercase all the columns in the datset
bankaya_grouped_df.columns = [col.lower() for col in bankaya_grouped_df.columns ]
In [49]:
# ---- Rename the columns for the datset and get a better understanding
new_cols_named = {
    'target_mean':'target','n_sf_apps_sum':'n_sf_apps',
    'account_to_application_days_count' : 'total_credits','max_credit_sum':'max_credit', 'total_payments_sum': 'total_payments_sum',
    'payment_amount_sum':'payment_amount','payment_to_due_ratio_mean':'payment_to_due_ratio','credit_limit_sum':'credit_limit','outstanding_balance_mean':'outstanding_balance',
    'n_bnpl_apps_sum':'n_bnpl_apps','max_days_late_sum':'max_days_late','last_payment_days_sum':'last_payment_days','delayed_payments_sum':'delayed_payments',
    'current_balance_sum':'current_balance','account_type_CREDITO DE HABILITACION O AVIO_sum':'account_type_CREDITO_DE_HABILITACION_O_AVIO',
    'account_type_HIPOTECA_sum':'account_type_HIPOTECA','account_type_PAGOS FIJOS_sum':'account_type_PAGOS_FIJOS',
    'account_type_POR DETERMINAR_sum':'account_type_POR_DETERMINAR','account_type_REVOLVENTE_sum':'account_type_REVOLVENTE',
    'account_type_SIN LIMITE PREESTABLECIDO_sum': 'account_type_SIN_LIMITE_PREESTABLECIDO','currency_DOLARES_sum':'currency_DOLARES',
    'currency_PESOS MEXICANOS_sum':'currency_PESOS_MEXICANOS','currency_UNIDADES DE INVERSION_sum':'currency_UNIDADES_DE_INVERSION',
    'responsability_type_AVAL_sum':'responsability_type_AVAL','responsability_type_INDIVIDUAL (TITULAR)_sum':'responsability_type_INDIVIDUAL(TITULAR)',
    'responsability_type_MANCOMUNADO_sum':'responsability_type_MANCOMUNADO', 'responsability_type_OBLIGATORIO SOLIDARIO_sum': 'responsability_type_OBLIGATORIO_SOLIDARIO',
    'responsability_type_TITULAR CON AVAL_sum':'responsability_type_TITULAR_CON_AVAL','prevention_key_risk_score_mean':'prevention_key_risk_score',
    'gen_credit_type_mean':'gen_credit_type','gen_business_type_mean':'gen_business_type','payment_frequency_mean':'payment_frequency'
}

# Use the rename method with the dictionary to rename the columns
bankaya_grouped_df = bankaya_grouped_df.rename(columns=new_cols_named)

# --- Display final dataset t be modeled 
bankaya_grouped_df
Out[49]:
customer_id target customer_lifetime n_sf_apps prevention_key_nunique total_credits loan_life_days_mean max_credit total_payments_sum n_inquiries_l3m_mean payment_amount payment_to_due_ratio time_to_loan_days_mean credit_limit n_bnpl_approved_apps_mean outstanding_balance n_inquiries_l6m_mean n_bnpl_apps max_days_late last_payment_days delayed_payments current_balance account_type_credito de habilitacion o avio_sum account_type_hipoteca_sum account_type_pagos fijos_sum account_type_por determinar_sum account_type_revolvente_sum account_type_sin limite preestablecido_sum currency_dolares_sum currency_pesos mexicanos_sum currency_unidades de inversion_sum responsability_type_aval_sum responsability_type_individual (titular)_sum responsability_type_mancomunado_sum responsability_type_obligatorio solidario_sum responsability_type_titular con aval_sum prevention_key_risk_score gen_credit_type gen_business_type payment_frequency
0 1 0.0 851 8.0 2 8 1341.500000 170056.0 1.0 NaN 1621.0 4.272500 4.0 552640.0 2.0 66834.000000 NaN 16.0 0 610.0 1 17968.0 0 0 0 0 6 2 0 8 0 0 8 0 0 0 1.250000 0.178660 0.173792 0.162429
1 2 0.0 828 80.0 4 10 1256.333333 152275.0 37.0 0.0 1689.0 NaN 2.0 338100.0 1.0 34702.333333 0.0 10.0 -10 2425.0 0 25779.0 0 0 1 0 9 0 0 10 0 0 10 0 0 0 2.000000 0.183405 0.165376 0.162429
2 3 1.0 818 0.0 4 32 67.090909 471787.0 410.0 0.0 12143.0 4.142500 11.0 536040.0 0.0 3392.906250 31.0 32.0 2016 2373.0 85 472759.0 0 1 29 2 0 0 0 32 0 0 32 0 0 0 1.500000 0.165853 0.167984 0.163515
3 4 0.0 687 78.0 2 13 26.333333 50855.0 59.0 0.0 5477.0 4.870769 23.0 37155.0 0.0 2155.846154 0.0 91.0 364 541.0 2 44823.0 0 0 10 0 3 0 0 13 0 0 13 0 0 0 1.076923 0.169715 0.169220 0.164236
4 5 0.0 694 49.0 9 49 528.000000 1348273.0 1799.0 0.0 450058.0 2.582857 28.0 846836.0 NaN 10717.413043 0.0 0.0 294 12021.0 304 589255.0 0 0 20 0 18 11 0 49 0 0 49 0 0 0 3.142857 0.172096 0.165770 0.166272
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
9244 14391 0.0 0 26.0 3 26 212.571429 487355.0 345.0 52.0 14081.0 0.192308 7.0 122779.0 0.0 5141.538462 52.0 26.0 546 2552.0 66 14081.0 0 0 19 1 2 4 0 26 0 0 21 1 4 0 2.461538 0.170023 0.170693 0.168942
9245 14404 0.0 2 0.0 5 26 128.937500 113601.0 394.0 28.0 58476.0 1.433077 3.0 81371.0 0.0 2935.692308 28.0 26.0 -26 1709.0 237 67855.0 0 0 20 0 3 3 0 26 0 0 26 0 0 0 2.923077 0.167796 0.167519 0.167344
9246 14408 0.0 0 4.0 2 4 113.500000 118516.0 254.0 10.0 31057.0 0.500000 4.0 0.0 0.0 7764.250000 10.0 4.0 92 332.0 141 31057.0 0 0 4 0 0 0 0 4 0 0 4 0 0 0 5.500000 0.168101 0.162282 0.170136
9247 14410 0.0 0 156.0 5 156 85.902098 436229.0 769.0 33.0 26278.0 0.160256 3.0 19248.0 0.0 209.846154 33.0 156.0 0 13469.0 353 28562.0 0 0 37 114 4 1 0 156 0 0 152 0 2 2 1.256410 0.166609 0.158096 0.169450
9248 14416 0.0 3 28.0 3 28 587.960000 569734.0 322.0 11.0 150671.0 0.107143 8.0 589691.0 0.0 15679.285714 11.0 28.0 0 5585.0 51 150671.0 0 0 19 3 4 2 0 28 0 0 28 0 0 0 1.357143 0.157965 0.157886 0.165593

9249 rows × 40 columns


Kmeans Clustering¶

In [50]:
from sklearn.cluster import KMeans
import numpy as np
In [51]:
# Prepare the data for clustering
# Remove NaN values and reshape the data to meet the requirements of the KMeans algorithm
customer_tenure_data = bankaya_grouped_df['customer_lifetime'].dropna().values.reshape(-1, 1)

# Using the elbow method to find the optimal number of clusters
inertia = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, random_state=0)
    kmeans.fit(customer_tenure_data)
    inertia.append(kmeans.inertia_)

# Plotting the elbow graph
plt.figure(figsize=(10, 5))
plt.plot(range(1, 11), inertia, marker='o')
plt.title('Elbow Method For Optimal Number of Clusters')
plt.xlabel('Number of Clusters')
plt.ylabel('Inertia')
plt.show()
  • The elbow graph helps us find the optimal number of clusters by plotting the inertia (within-cluster sum of squares) for different numbers of clusters. In this case, the elbow point seems to occur around 3 or 4 clusters, where the inertia starts to decrease at a slower rate. Given this, I could choose either 3 or 4 clusters for segmenting the customer tenure.
In [52]:
# ---Perform K-means clustering with 3 clusters (as one of the potential elbow points)
kmeans = KMeans(n_clusters=3, random_state=0)
bankaya_grouped_df['Lifetime_Cluster'] = kmeans.fit_predict(customer_tenure_data)

#---Calculate the mean and count for each cluster
cluster_summary = bankaya_grouped_df.groupby('Lifetime_Cluster')['customer_lifetime'].agg(['mean', 'count']).reset_index()

# ---Sort clusters by the mean tenure
cluster_summary = cluster_summary.sort_values(by='mean')

# ---Rename clusters based on the sorted mean tenure
cluster_map = {row['Lifetime_Cluster']: 'Cluster_{}'.format(index) for index, row in cluster_summary.iterrows()}
bankaya_grouped_df['Lifetime_Cluster'] = bankaya_grouped_df['Lifetime_Cluster'].map(cluster_map)
bankaya_grouped_df['Lifetime_Cluster'] = bankaya_grouped_df['Lifetime_Cluster'].replace({'Cluster_1': 'Long_Tenure',
                                                                                             'Cluster_0': 'Medium_Tenure',
                                                                                             'Cluster_2': 'Short_Tenure'})


# ---Show the cluster summary and some sample rows
print(cluster_summary)

bankaya_grouped_df.sample(5)
   Lifetime_Cluster        mean  count
1                 1   70.579334   4355
0                 0  261.493224   3173
2                 2  516.891342   1721
Out[52]:
customer_id target customer_lifetime n_sf_apps prevention_key_nunique total_credits loan_life_days_mean max_credit total_payments_sum n_inquiries_l3m_mean payment_amount payment_to_due_ratio time_to_loan_days_mean credit_limit n_bnpl_approved_apps_mean outstanding_balance n_inquiries_l6m_mean n_bnpl_apps max_days_late last_payment_days delayed_payments current_balance account_type_credito de habilitacion o avio_sum account_type_hipoteca_sum account_type_pagos fijos_sum account_type_por determinar_sum account_type_revolvente_sum account_type_sin limite preestablecido_sum currency_dolares_sum currency_pesos mexicanos_sum currency_unidades de inversion_sum responsability_type_aval_sum responsability_type_individual (titular)_sum responsability_type_mancomunado_sum responsability_type_obligatorio solidario_sum responsability_type_titular con aval_sum prevention_key_risk_score gen_credit_type gen_business_type payment_frequency Lifetime_Cluster
3535 4290 1.0 190 28.0 3 28 44.923077 47222.0 102.0 0.0 15731.0 2.977500 2.0 17106.0 0.0 601.571429 0.0 56.0 1736 964.0 177 30148.0 0 0 21 0 3 4 0 28 0 0 28 0 0 0 1.321429 0.168919 0.166691 0.163791 Medium_Tenure
8470 11758 0.0 8 13.0 4 13 217.750000 71356.0 149.0 16.0 3377.0 10.236154 13.0 24469.0 0.0 3162.000000 16.0 13.0 13 1008.0 49 19089.0 0 0 7 0 0 6 0 13 0 0 13 0 0 0 3.076923 0.168947 0.175107 0.166437 Long_Tenure
3862 4692 0.0 251 4.0 2 4 NaN 64351.0 906.0 0.0 86493.0 9.342500 8.0 51500.0 1.0 11773.000000 0.0 8.0 24 74.0 25 98572.0 0 0 1 0 3 0 0 4 0 0 4 0 0 0 3.000000 0.173848 0.166270 0.166254 Medium_Tenure
9019 13611 0.0 0 0.0 7 60 143.975610 225496.0 313.0 70.0 244416.0 0.696000 5.0 166976.0 NaN 4564.859649 70.0 0.0 0 3992.0 218 253137.0 0 0 49 0 9 2 0 60 0 0 60 0 0 0 2.016667 0.168732 0.168494 0.166905 Long_Tenure
5797 7404 0.0 204 0.0 4 21 145.750000 192926.0 1162.0 0.0 21165.0 0.341905 3.0 147853.0 0.0 5801.047619 0.0 21.0 0 2362.0 52 26031.0 0 0 17 0 4 0 0 21 0 0 21 0 0 0 2.047619 0.167250 0.167660 0.167200 Medium_Tenure
  • Cluster_2 (Short Tenure): With an average tenure of about 71 days, these are relatively new customers. They might need more educational content and initial offers to fully engage with the services.

  • Cluster_0 (Medium Tenure): With an average tenure of about 261 days, these customers have been engaged for a reasonable time. They could be targeted for up-sell or cross-sell opportunities.

  • Cluster_1 (Long Tenure): With an average tenure of about 517 days, these are the most loyal customers who might be more open to premium services or loyalty programs.

    The average customer tenure is around 219 days, but there's a wide range, up to 901 days. This indicates that while some customers are relatively new, others have been engaged for a longer time. Understanding customer tenure can help in segmenting the customer base for targeted marketing. For example, longer-tenured customers might be more receptive to cross-selling or up-selling initiatives, while shorter-tenured customers might benefit from welcome offers or educational content.


RFM Analysis¶

First, I will identify the key columns I need for RFM analysis:

  • Recency will be based on 'LAST_PAYMENT_DATE'.
  • Frequency will be based on the number of applications by 'customer_id'
  • Monetary will be based on 'CURRENT_BALANCE'
In [53]:
# Calculate Recency, Frequency, and Monetary value for each customer
# Recency
recency_df = bankaya_final_model_df.groupby('customer_id')['last_payment_date'].max().reset_index()
recency_df.columns = ['customer_id', 'Recent_Payment_Date']
recency_df['Recency'] = (bankaya_final_model_df['last_payment_date'].max() - recency_df['Recent_Payment_Date']).dt.days

# Frequency
frequency_df = bankaya_final_model_df.groupby('customer_id')['application_datetime'].count().reset_index()
frequency_df.columns = ['customer_id', 'Frequency']

# Monetary
monetary_df = bankaya_final_model_df.groupby('customer_id')['current_balance'].mean().reset_index()
monetary_df.columns = ['customer_id', 'Monetary']

# Merge these dataframes to create the RFM table
rfm_df = pd.merge(recency_df, frequency_df, on='customer_id')
rfm_df = pd.merge(rfm_df, monetary_df, on='customer_id')

# Show some statistics and sample rows for the RFM features
rfm_stats = rfm_df[['Recency', 'Frequency', 'Monetary']].describe()
rfm_sample = rfm_df.sample(5)

rfm_stats
Out[53]:
Recency Frequency Monetary
count 9145.000000 9249.000000 9172.000000
mean 463.319956 30.839766 6852.744311
std 424.451785 30.582474 14799.353291
min 0.000000 1.000000 0.000000
25% 249.000000 10.000000 1223.190000
50% 374.000000 21.000000 2614.046296
75% 542.000000 41.000000 5960.183279
max 6474.000000 269.000000 481599.500000
In [54]:
rfm_sample
Out[54]:
customer_id Recent_Payment_Date Recency Frequency Monetary
7671 10605 2022-10-17 202.0 28 1878.642857
73 90 2022-06-10 331.0 91 432.857143
8602 12060 2022-12-19 139.0 19 2620.000000
8324 11564 2022-11-18 170.0 9 226.555556
8871 13084 2023-01-09 118.0 139 3130.467626
  • Recency: A high average recency of 463 days suggests that many customers have not interacted with the services for a significant amount of time. Retention campaigns might be needed.

  • Frequency: The average frequency is 31, indicating that customers tend to apply for loans or other services multiple times. Cross-selling or up-selling strategies could be effective.

  • Monetary: The high variation in the monetary value, ranging from $0 to $481,600, suggests a diverse customer base in terms of value. Segmenting customers based on this could help in personalized marketing.

  • High-Value Customers: Customers with low recency, high frequency, and high monetary values are generally the most valuable. Focused retention strategies can be designed for this segment.

In [55]:
# --- Further segmenting the RFM scores using quantiles
# For Recency, lower the value, better the segment. Hence, labels are in reverse order
rfm_df['R_Segment'] = pd.qcut(rfm_df['Recency'], 4, labels=['4', '3', '2', '1'])

# --- For Frequency and Monetary, higher the value, better the segment
rfm_df['F_Segment'] = pd.qcut(rfm_df['Frequency'], 4, labels=['1', '2', '3', '4'])
rfm_df['M_Segment'] = pd.qcut(rfm_df['Monetary'], 4, labels=['1', '2', '3', '4'])

# --- Combine the individual segments to create a combined RFM score
rfm_df['RFM_Score'] = rfm_df['R_Segment'].astype(str) + rfm_df['F_Segment'].astype(str) + rfm_df['M_Segment'].astype(str)

# --- -Define the RFM level function
def rfm_level(df):
    if df['RFM_Score'] == '444':
        return 'Champion'
    elif df['F_Segment'] == '4' and df['M_Segment'] == '4':
        return 'Loyal Customer'
    elif df['R_Segment'] == '1':
        return 'At Risk'
    elif df['R_Segment'] == '2' and df['F_Segment'] == '1':
        return 'Needs Attention'
    else:
        return 'Average'

# --- Calculate RFM Level
rfm_df['RFM_Level'] = rfm_df.apply(rfm_level, axis=1)

# --- Show the distribution of RFM Level and some sample rows
rfm_level_summary = rfm_df['RFM_Level'].value_counts().reset_index()
rfm_level_summary.columns = ['RFM_Level', 'Count']
rfm_sample_segmented = rfm_df

rfm_level_summary
Out[55]:
RFM_Level Count
0 Average 6168
1 At Risk 2222
2 Needs Attention 510
3 Loyal Customer 222
4 Champion 127
In [56]:
# --- Display the RFM scores. I will use these scores to included in the model
rfm_sample_segmented
Out[56]:
customer_id Recent_Payment_Date Recency Frequency Monetary R_Segment F_Segment M_Segment RFM_Score RFM_Level
0 1 2022-05-27 345.0 8 2246.000000 3 1 2 312 Average
1 2 2022-04-16 386.0 10 2577.900000 2 1 2 212 Needs Attention
2 3 2022-09-07 242.0 32 14773.718750 4 3 4 434 Average
3 4 2022-01-03 489.0 13 3447.923077 2 2 3 223 Average
4 5 2021-09-24 590.0 49 12025.612245 1 4 4 144 Loyal Customer
... ... ... ... ... ... ... ... ... ... ...
9244 14391 2023-03-01 67.0 26 541.576923 4 3 1 431 Average
9245 14404 2023-04-10 27.0 26 2609.807692 4 3 2 432 Average
9246 14408 2021-06-30 676.0 4 7764.250000 1 1 4 114 At Risk
9247 14410 2023-03-22 46.0 156 183.089744 4 4 1 441 Average
9248 14416 2023-01-12 115.0 28 5381.107143 4 3 3 433 Average

9249 rows × 10 columns

  • Champion: These are your best customers who are recent, frequent, and high spenders. Focus on retaining them and offering them personalized services.

  • Loyal Customer: These customers are frequent and high spenders but might not be as recent. Keep them engaged through loyalty programs or personalized offers.

  • At Risk: These are customers who have not interacted recently but have a history of frequent interactions and spending. Retargeting campaigns can bring them back.

  • Needs Attention: These customers have not interacted recently and have low frequency and monetary value. They require nurturing to move up the value chain.

  • Average: These customers fall in between all these categories. Regular marketing campaigns could be effective for this group.


Data Pipeline Tranformation¶

In [57]:
## --- Enhancing the DataFrame with RFM Scores for Modeling:
bankaya_model_merged_df = bankaya_grouped_df.merge(rfm_df[['customer_id', 'RFM_Score']], on='customer_id', how='left')

# Replace 'nan' strings with actual NaN values (np.nan)
bankaya_model_merged_df['RFM_Score'] = bankaya_model_merged_df['RFM_Score'].replace('nan', np.nan, regex=True)

# Now I can fill NaN values with a placeholder, or drop them, etc.
bankaya_model_merged_df['RFM_Score'] = bankaya_model_merged_df['RFM_Score'].fillna(0)

# Attempt conversion to integer
bankaya_model_merged_df['RFM_Score'] = bankaya_model_merged_df['RFM_Score'].astype(int)

bankaya_model_merged_df.sample(5)
Out[57]:
customer_id target customer_lifetime n_sf_apps prevention_key_nunique total_credits loan_life_days_mean max_credit total_payments_sum n_inquiries_l3m_mean payment_amount payment_to_due_ratio time_to_loan_days_mean credit_limit n_bnpl_approved_apps_mean outstanding_balance n_inquiries_l6m_mean n_bnpl_apps max_days_late last_payment_days delayed_payments current_balance account_type_credito de habilitacion o avio_sum account_type_hipoteca_sum account_type_pagos fijos_sum account_type_por determinar_sum account_type_revolvente_sum account_type_sin limite preestablecido_sum currency_dolares_sum currency_pesos mexicanos_sum currency_unidades de inversion_sum responsability_type_aval_sum responsability_type_individual (titular)_sum responsability_type_mancomunado_sum responsability_type_obligatorio solidario_sum responsability_type_titular con aval_sum prevention_key_risk_score gen_credit_type gen_business_type payment_frequency Lifetime_Cluster RFM_Score
1953 2269 0.0 538 242.0 5 121 121.812500 514592.0 754.0 0.0 196207.0 0.791488 6.0 286582.0 0.0 2498.965217 0.0 121.0 3267 14090.0 285 246840.0 0 0 73 37 7 4 0 121 0 0 82 1 0 38 1.685950 0.166163 0.164210 0.168463 Short_Tenure 142
2918 3525 0.0 243 19.0 2 19 40.166667 98732.0 241.0 0.0 32347.0 1.493158 4.0 23265.0 0.0 2709.368421 0.0 19.0 228 2043.0 178 33943.0 0 0 6 10 1 2 0 19 0 0 19 0 0 0 2.263158 0.166774 0.161613 0.168102 Medium_Tenure 122
6528 8768 1.0 143 1.0 1 1 252.000000 5213.0 16.0 0.0 3558.0 1.000000 30.0 0.0 0.0 3558.000000 8.0 1.0 42 252.0 9 3558.0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 9.000000 0.166172 0.169220 0.170136 Long_Tenure 213
2842 3429 0.0 318 114.0 4 57 164.217391 765839.0 1426.0 0.0 139123.0 5.715088 23.0 474741.0 1.0 1334.272727 0.0 57.0 1596 5605.0 277 503648.0 0 2 53 0 1 1 0 57 0 0 55 0 2 0 2.561404 0.167614 0.166786 0.167638 Medium_Tenure 244
4817 5967 1.0 64 0.0 4 29 185.454545 103264.0 390.0 26.0 21002.0 0.922069 107.0 30888.0 0.0 1632.827586 26.0 29.0 1798 3987.0 34 23588.0 0 0 25 0 1 3 0 29 0 0 29 0 0 0 1.793103 0.167236 0.171364 0.167094 Long_Tenure 331
In [58]:
# --- Manual mappings to assign the values to each category
lifetime_cluster_mapping = {
    'Short_Tenure': 1,
    'Medium_Tenure': 2,
    'Long_Tenure': 3,
}
# --- Mapping the lifetime cluster and the RFM importance with the scores given. 
bankaya_model_merged_df['Lifetime_Cluster_score'] = bankaya_model_merged_df['Lifetime_Cluster'].map(lifetime_cluster_mapping)

# --- Drop unnecessary columns 
bankaya_model_merged_df.drop("Lifetime_Cluster",inplace=True,axis=1)
In [59]:
# <<<< Function to create the pipeline for the model >>>>
def transform_data_test():

    continuous_pipeline = Pipeline([
        ("imputer",SimpleImputer(strategy="mean")),
        ("scaler",RobustScaler(with_centering=True,with_scaling=True))

    ])

    preproccessing_pipeline = ColumnTransformer([
        ("continous",continuous_pipeline,model_numerical_vars)

    ],
        remainder = "drop"
    )

    pipeline = ImbPipeline(steps=([
        ("preprocess",preproccessing_pipeline),
        ("smote", SMOTE(random_state=42)) 
        
        
    ]))
    return pipeline

Model Development¶

To prevent data leakage, it's crucial to separate the dataset into training and testing sets before applying any transformations. For a streamlined and automated transformation process, pipelines have been implemented for numerical variables.
In [60]:
#!pip install imbalanced-learn
In [61]:
# --- Identify all numerical variables in the dataset for the modeling pprocesing. 
model_numerical_vars = [col for col in bankaya_model_merged_df.columns if 'target'not in col and 'customer_id' not in col ]
In [62]:
# --- Pipeline process to be applied to the X_train and X_test
pipeline_data = transform_data_test()
pipeline_data
Out[62]:
Pipeline(steps=[('preprocess',
                 ColumnTransformer(transformers=[('continous',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer()),
                                                                  ('scaler',
                                                                   RobustScaler())]),
                                                  ['customer_lifetime',
                                                   'n_sf_apps',
                                                   'prevention_key_nunique',
                                                   'total_credits',
                                                   'loan_life_days_mean',
                                                   'max_credit',
                                                   'total_payments_sum',
                                                   'n_inquiries_l3m_mean',
                                                   'payment_amount',
                                                   'payment_to_due_ratio',
                                                   'time_to_l...
                                                   'habilitacion o avio_sum',
                                                   'account_type_hipoteca_sum',
                                                   'account_type_pagos '
                                                   'fijos_sum',
                                                   'account_type_por '
                                                   'determinar_sum',
                                                   'account_type_revolvente_sum',
                                                   'account_type_sin limite '
                                                   'preestablecido_sum',
                                                   'currency_dolares_sum',
                                                   'currency_pesos '
                                                   'mexicanos_sum',
                                                   'currency_unidades de '
                                                   'inversion_sum',
                                                   'responsability_type_aval_sum', ...])])),
                ('smote', SMOTE(random_state=42))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('preprocess',
                 ColumnTransformer(transformers=[('continous',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer()),
                                                                  ('scaler',
                                                                   RobustScaler())]),
                                                  ['customer_lifetime',
                                                   'n_sf_apps',
                                                   'prevention_key_nunique',
                                                   'total_credits',
                                                   'loan_life_days_mean',
                                                   'max_credit',
                                                   'total_payments_sum',
                                                   'n_inquiries_l3m_mean',
                                                   'payment_amount',
                                                   'payment_to_due_ratio',
                                                   'time_to_l...
                                                   'habilitacion o avio_sum',
                                                   'account_type_hipoteca_sum',
                                                   'account_type_pagos '
                                                   'fijos_sum',
                                                   'account_type_por '
                                                   'determinar_sum',
                                                   'account_type_revolvente_sum',
                                                   'account_type_sin limite '
                                                   'preestablecido_sum',
                                                   'currency_dolares_sum',
                                                   'currency_pesos '
                                                   'mexicanos_sum',
                                                   'currency_unidades de '
                                                   'inversion_sum',
                                                   'responsability_type_aval_sum', ...])])),
                ('smote', SMOTE(random_state=42))])
ColumnTransformer(transformers=[('continous',
                                 Pipeline(steps=[('imputer', SimpleImputer()),
                                                 ('scaler', RobustScaler())]),
                                 ['customer_lifetime', 'n_sf_apps',
                                  'prevention_key_nunique', 'total_credits',
                                  'loan_life_days_mean', 'max_credit',
                                  'total_payments_sum', 'n_inquiries_l3m_mean',
                                  'payment_amount', 'payment_to_due_ratio',
                                  'time_to_loan_days_mean', 'credit_limit',...
                                  'current_balance',
                                  'account_type_credito de habilitacion o '
                                  'avio_sum',
                                  'account_type_hipoteca_sum',
                                  'account_type_pagos fijos_sum',
                                  'account_type_por determinar_sum',
                                  'account_type_revolvente_sum',
                                  'account_type_sin limite preestablecido_sum',
                                  'currency_dolares_sum',
                                  'currency_pesos mexicanos_sum',
                                  'currency_unidades de inversion_sum',
                                  'responsability_type_aval_sum', ...])])
['customer_lifetime', 'n_sf_apps', 'prevention_key_nunique', 'total_credits', 'loan_life_days_mean', 'max_credit', 'total_payments_sum', 'n_inquiries_l3m_mean', 'payment_amount', 'payment_to_due_ratio', 'time_to_loan_days_mean', 'credit_limit', 'n_bnpl_approved_apps_mean', 'outstanding_balance', 'n_inquiries_l6m_mean', 'n_bnpl_apps', 'max_days_late', 'last_payment_days', 'delayed_payments', 'current_balance', 'account_type_credito de habilitacion o avio_sum', 'account_type_hipoteca_sum', 'account_type_pagos fijos_sum', 'account_type_por determinar_sum', 'account_type_revolvente_sum', 'account_type_sin limite preestablecido_sum', 'currency_dolares_sum', 'currency_pesos mexicanos_sum', 'currency_unidades de inversion_sum', 'responsability_type_aval_sum', 'responsability_type_individual (titular)_sum', 'responsability_type_mancomunado_sum', 'responsability_type_obligatorio solidario_sum', 'responsability_type_titular con aval_sum', 'prevention_key_risk_score', 'gen_credit_type', 'gen_business_type', 'payment_frequency', 'RFM_Score', 'Lifetime_Cluster_score']
SimpleImputer()
RobustScaler()
SMOTE(random_state=42)
In [63]:
# <<<< Búsqueda de hiperparametros de cada modelo >>>>

classifiers = {
    
    'Bagging': BaggingClassifier(random_state=314),
    'RF': RandomForestClassifier(random_state=314,n_jobs=-1),
    'AdaBoost': AdaBoostClassifier(base_estimator = DecisionTreeClassifier(max_depth=2, random_state=314),random_state=314),
    'GBoosting': GradientBoostingClassifier(random_state=314),
    "XGBoost": xgb.XGBClassifier(random_state = 314),
    'Logistic_r':LogisticRegression(random_state = 314),
    'D_tree':DecisionTreeClassifier(criterion='gini', random_state=314)
}

params = {
          
            'RF':{'n_estimators':range(1,11),
                  'max_depth': range(1,11),
                  "min_samples_split":[2,4]
                 },
    
            'AdaBoost': {'learning_rate':[1e-5, 1e-4,1e-3, 1e-2, 1e-1, 1.0],
                         'n_estimators':range(1,11)
                         },
    
            'GBoosting': {'max_depth':range(1,11),
                          'n_estimators':range(1,11),
                          'learning_rate':[1e-3, 1e-2, 1e-1,1.0]
                         },
    
            'XGBoost': {'max_depth':range(1,11),
                        'n_estimators':range(1,11),
                        "learning_rate":[1e-3, 1e-2, 1e-1,1.0],
                        'eval_metric': ['logloss']
                       },
    
            'Logistic_r': {'penalty' : ['l1'],
                           'C': [0.01, 0.03,0.04,0.05,0.06],
                           'solver': ['saga'],
                           "max_iter":[5000,10000]
                           },
    
            'D_tree': { 'max_depth': [2,4,5,6,7,8],
                        'min_samples_leaf': [1, 5, 10, 20, 35],
                        'min_samples_split': [2, 15, 30, 60],
                       }
        }
In [64]:
#--- Split DataFrame into train and test
y_bankaya_model = bankaya_model_merged_df["target"]
X_bankaya_model = bankaya_model_merged_df.drop(["target","customer_id"],axis=1) # --- Remove unneccessary columns
In [65]:
# --- Split model into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X_bankaya_model,y_bankaya_model, test_size=0.2, random_state=42, stratify=y_bankaya_model)

# Replace infinite values with NaN
X_train.replace([np.inf, -np.inf], np.nan, inplace=True)
X_test.replace([np.inf, -np.inf], np.nan, inplace=True)
In [66]:
print("Number transactions X_train dataset: ", X_train.shape)
print("Number transactions y_train dataset: ", y_train.shape)
print("Number transactions X_test dataset: ", X_test.shape)
print("Number transactions y_test dataset: ", y_test.shape)
print("\n========Variables objetivo dataset Train===========\n")
print("Before OverSampling, counts of target '1': {}".format(sum(y_train==1)))
print("Before OverSampling, counts of target '0': {} \n".format(sum(y_train==0)))
Number transactions X_train dataset:  (7399, 40)
Number transactions y_train dataset:  (7399,)
Number transactions X_test dataset:  (1850, 40)
Number transactions y_test dataset:  (1850,)

========Variables objetivo dataset Train===========

Before OverSampling, counts of target '1': 1380
Before OverSampling, counts of target '0': 6019 

Trainig and Test Oversampling SMOTE¶

In [67]:
# --- Use fit_resample method just in train to apply the resample SMOTE method
X_train_resampled, y_train_resampled = pipeline_data.fit_resample(X_train, y_train) 

# First, get the preprocessing and transformation steps without the SMOTE step
preprocess_pipeline = pipeline_data[:-1]

# --- Apply the transform method to the test set using the obtained preprocessing pipeline
X_test_scaled = preprocess_pipeline.transform(X_test) # apply transform just in test
In [68]:
# Convert the scaled and transformed X_train back to a DataFrame
X_train_resampled_df = pd.DataFrame(X_train_resampled, columns=model_numerical_vars)

# Convert the scaled and transformed X_test back to a DataFrame
X_test_scaled_df = pd.DataFrame(X_test_scaled, columns=model_numerical_vars)
In [69]:
print('After SMOTE, the shape of X_train_under: {}'.format(X_train_resampled.shape))
print('After SMOTE, the shape of y_train_under: {} \n'.format(y_train_resampled.shape))

print("After SMOTE, counts of label '1': {}".format(sum(y_train_resampled==1)))
print("After SMOTE, counts of label '0': {}".format(sum(y_train_resampled==0)))
After SMOTE, the shape of X_train_under: (12038, 40)
After SMOTE, the shape of y_train_under: (12038,) 

After SMOTE, counts of label '1': 6019
After SMOTE, counts of label '0': 6019
In [70]:
import warnings
warnings.filterwarnings("ignore")
from sklearn.experimental import enable_halving_search_cv  
from sklearn.model_selection import HalvingGridSearchCV
from sklearn.experimental import enable_halving_search_cv

Model Selection¶

In [71]:
# Create an empety dictionary where the model will be stored
clasifier_dict = {} 
In [72]:
# < ===== GridSeacrh del modelo XGBoost ====== >

for name in list(params.keys()):
    if name == "XGBoost":
        print("================>")
        print(name)
        print("================>")
        
        clasifier_dict[name] = GridSearchCV(classifiers[name],
                                            params[name],
                                            #param_grid= pipeline["classifier"].available_models[name],
                                            scoring='roc_auc', 
                                            return_train_score=True, 
                                            n_jobs=-1, cv=3, verbose=1)
        
   
        
        clasifier_dict[name].fit(X_train_resampled, y_train_resampled)
================>
XGBoost
================>
Fitting 3 folds for each of 400 candidates, totalling 1200 fits
In [73]:
# < ======= GridSeacrh del modelo Logistic regression ========= >

for name in list(params.keys()):
    if name == "Logistic_r":
        print("================>")
        print(name)
        print("================>")
        clasifier_dict[name] = GridSearchCV(estimator= classifiers[name],
                                            param_grid=params[name],
                                            #param_grid= pipeline["classifier"].available_models[name],
                                            scoring="roc_auc", 
                                            return_train_score=True, 
                                            n_jobs=-1, cv=3, verbose=1)
        
        clasifier_dict[name].fit(X_train_resampled, y_train_resampled)
================>
Logistic_r
================>
Fitting 3 folds for each of 10 candidates, totalling 30 fits
In [74]:
# < ======= GridSeacrh del modelo GBoosting ========= >

for name in list(params.keys()):
    if name == "GBoosting":
        print("================>")
        print(name)
        print("================>")
        clasifier_dict[name] = GridSearchCV(estimator= classifiers[name],
                                            param_grid=params[name],
                                            #param_grid= pipeline["classifier"].available_models[name],
                                            scoring="roc_auc", 
                                            return_train_score=True, 
                                            n_jobs=-1, cv=3, verbose=1)
        
        clasifier_dict[name].fit(X_train_resampled, y_train_resampled)
================>
GBoosting
================>
Fitting 3 folds for each of 400 candidates, totalling 1200 fits
In [86]:
# < ======= GridSeacrh del modelo AdaBoost ========= >

for name in list(params.keys()):
    if name == "AdaBoost":
        print("================>")
        print(name)
        print("================>")
        clasifier_dict[name] = GridSearchCV(estimator= classifiers[name],
                                            param_grid=params[name],
                                            scoring='roc_auc', 
                                            return_train_score=True, 
                                            n_jobs=-1, cv=3, verbose=1)
        
        clasifier_dict[name].fit(X_train_resampled, y_train_resampled)
In [76]:
# < ======= GridSeacrh del modelo Random Forest ========= >

for name in list(params.keys()):
    if name == "RF":
        print("================>")
        print(name)
        print("================>")
        clasifier_dict[name] = GridSearchCV(estimator= classifiers[name],
                                            param_grid=params[name],
                                            scoring='roc_auc', 
                                            return_train_score=True, 
                                            n_jobs=-1, cv=3, verbose=1)
        
        clasifier_dict[name].fit(X_train_resampled, y_train_resampled)
================>
RF
================>
Fitting 3 folds for each of 200 candidates, totalling 600 fits
In [77]:
# < ======= GridSeacrh del modelo Decision Tree ========= >

for name in list(params.keys()):
    if name == "D_tree":
        print("================>")
        print(name)
        print("================>")
        clasifier_dict[name] = GridSearchCV(estimator= classifiers[name],
                                            param_grid=params[name],
                                            scoring='roc_auc', 
                                            return_train_score=True, 
                                            n_jobs=-1, cv=3, verbose=1)
        
        clasifier_dict[name].fit(X_train_resampled, y_train_resampled)
================>
D_tree
================>
Fitting 3 folds for each of 120 candidates, totalling 360 fits
In [78]:
clasifier_dict
Out[78]:
{'XGBoost': GridSearchCV(cv=3,
              estimator=XGBClassifier(base_score=None, booster=None,
                                      callbacks=None, colsample_bylevel=None,
                                      colsample_bynode=None,
                                      colsample_bytree=None, device=None,
                                      early_stopping_rounds=None,
                                      enable_categorical=False, eval_metric=None,
                                      feature_types=None, gamma=None,
                                      grow_policy=None, importance_type=None,
                                      interaction_constraints=None,
                                      learning_rate=None,...
                                      max_leaves=None, min_child_weight=None,
                                      missing=nan, monotone_constraints=None,
                                      multi_strategy=None, n_estimators=None,
                                      n_jobs=None, num_parallel_tree=None,
                                      random_state=314, ...),
              n_jobs=-1,
              param_grid={'eval_metric': ['logloss'],
                          'learning_rate': [0.001, 0.01, 0.1, 1.0],
                          'max_depth': range(1, 11),
                          'n_estimators': range(1, 11)},
              return_train_score=True, scoring='roc_auc', verbose=1),
 'Logistic_r': GridSearchCV(cv=3, estimator=LogisticRegression(random_state=314), n_jobs=-1,
              param_grid={'C': [0.01, 0.03, 0.04, 0.05, 0.06],
                          'max_iter': [5000, 10000], 'penalty': ['l1'],
                          'solver': ['saga']},
              return_train_score=True, scoring='roc_auc', verbose=1),
 'GBoosting': GridSearchCV(cv=3, estimator=GradientBoostingClassifier(random_state=314),
              n_jobs=-1,
              param_grid={'learning_rate': [0.001, 0.01, 0.1, 1.0],
                          'max_depth': range(1, 11),
                          'n_estimators': range(1, 11)},
              return_train_score=True, scoring='roc_auc', verbose=1),
 'AdaBoost': GridSearchCV(cv=3,
              estimator=AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=2,
                                                                                 random_state=314),
                                           random_state=314),
              n_jobs=-1,
              param_grid={'learning_rate': [1e-05, 0.0001, 0.001, 0.01, 0.1,
                                            1.0],
                          'n_estimators': range(1, 11)},
              return_train_score=True, scoring='roc_auc', verbose=1),
 'RF': GridSearchCV(cv=3,
              estimator=RandomForestClassifier(n_jobs=-1, random_state=314),
              n_jobs=-1,
              param_grid={'max_depth': range(1, 11), 'min_samples_split': [2, 4],
                          'n_estimators': range(1, 11)},
              return_train_score=True, scoring='roc_auc', verbose=1),
 'D_tree': GridSearchCV(cv=3, estimator=DecisionTreeClassifier(random_state=314),
              n_jobs=-1,
              param_grid={'max_depth': [2, 4, 5, 6, 7, 8],
                          'min_samples_leaf': [1, 5, 10, 20, 35],
                          'min_samples_split': [2, 15, 30, 60]},
              return_train_score=True, scoring='roc_auc', verbose=1)}
In [79]:
# ======= Results of the Model  ==============
test_scores = []

for name in list(params.keys()):
    try:
        print(name)
        print("===========>")
        print("Number of configurations: {}\n".format(pd.DataFrame.from_dict(clasifier_dict[name].cv_results_).shape[0]))
        aux = pd.DataFrame.from_dict(clasifier_dict[name].cv_results_).sort_values(by="mean_test_score",ascending=False).iloc[0]
        roc_cv_test = aux['mean_test_score']
        roc_cv_train = aux['mean_train_score']

        roc_train = roc_auc_score(y_train_resampled, clasifier_dict[name].predict(X_train_resampled))
        roc_test = roc_auc_score(y_test, clasifier_dict[name].predict(X_test_scaled))

        test_scores.append((name,roc_train,roc_cv_train,roc_cv_test,roc_test))
        
    except:
        print("Model {} is not store in the database\n".format(name))
RF
===========>
Number of configurations: 200

AdaBoost
===========>
Number of configurations: 60

GBoosting
===========>
Number of configurations: 400

XGBoost
===========>
Number of configurations: 400

Logistic_r
===========>
Number of configurations: 10

D_tree
===========>
Number of configurations: 120


Model Evaluation¶

In [80]:
#Performance of the Models

resultado = pd.DataFrame(test_scores,columns=['Classifier','AUC_train','AUC_CV_train','AUC_validation','AUC_test'])

print("\n=========== Performance Table ============")
display(resultado)

best_clas = resultado.sort_values(by="AUC_validation",ascending=False).iloc[0]['Classifier']

print("El mejor clasificador es: {}".format(best_clas))
=========== Performance Table ============
Classifier AUC_train AUC_CV_train AUC_validation AUC_test
0 RF 0.982555 0.999646 0.995155 0.940994
1 AdaBoost 0.987539 0.999339 0.998176 0.979315
2 GBoosting 1.000000 1.000000 0.999153 0.994111
3 XGBoost 1.000000 1.000000 0.999783 0.998671
4 Logistic_r 0.963449 0.990104 0.989697 0.951683
5 D_tree 0.990281 0.999219 0.994379 0.975179
El mejor clasificador es: XGBoost

Classifiers:

  1. RF: Random Forest
  2. AdaBoost: Adaptive Boosting
  3. XGBoost: Extreme Gradient Boosting
  4. Logistic_r: Logistic Regression
  5. D_tree: Decision Tree

Datasets:

  1. AUC_train: AUC on the training dataset.
  2. AUC_CV_train: AUC on the cross-validation set derived from the training data.
  3. AUC_validation: AUC on a separate validation dataset.
  4. AUC_test: AUC on the test dataset.

AUC (Area Under the Curve): An AUC score of 1 indicates perfect classification, while 0.5 suggests no discriminative power (equivalent to random guessing).

Insights:

  • Overfitting Check: Comparing AUC_train with AUC_test helps to identify potential overfitting. For instance, if AUC_train is very high but AUC_test is significantly lower, the model might be overfitting the training data. The XGBoost classifier has perfect scores on training and cross-validation, which could suggest potential overfitting. However, its high AUC_test indicates that it still performs exceptionally well on unseen data.

  • Cross-Validation Consistency: AUC_CV_train provides a measure of the model's performance on different subsets of the training data. If AUC_CV_train is close to AUC_train, it suggests the model is stable across different subsets of the training data.

  • Recommendation: Given the results, XGBoost seems to be the best model in terms of AUC across all datasets. However, due to its perfect scores on training and cross-validation, it's essential to ensure it's not overfitting through further analysis, like checking feature importances or using regularization.
In [81]:
# --- Get the best model
best_model = clasifier_dict["XGBoost"].best_estimator_

# --- Get feature importances from the best model
feature_importances = best_model.feature_importances_
In [82]:
# >>>> Confusion Matrix XGBoost <<<<

y_target = y_test
y_predicted = clasifier_dict["XGBoost"].predict(X_test_scaled)

cm = confusion_matrix(y_target,
                      y_predicted)

fig, ax = plot_confusion_matrix(conf_mat=cm)
plt.show()

precision = precision_score(y_target, y_predicted)
recall = recall_score(y_target, y_predicted)
F1=f1_score(y_target, y_predicted)

print('Precision: {:.2f} \nRecall {:.2f}\nF1 score {:.2f}'.format(precision *100, recall * 100,F1 * 100 ))
Precision: 98.85 
Recall 100.00
F1 score 99.42
  • High Precision: The model is extremely good at ensuring that a majority of the customers it predicts as risky (predicted label: 1) are actually risky (true label: 1). This means that only a very small fraction of good customers are incorrectly labeled as risky. This will help the bank avoid giving loans to potentially risky customers.

  • Perfect Recall: A recall of 100% means that the model identifies all the risky customers (true label 1). No risky customer will go unnoticed, which is crucial for Bankaya as it reduces the potential for bad loans.

  • High F1 Score: An F1 score of 99.42% is an indication that there's a good balance between precision and recall. This further solidifies the model's reliability in differentiating between good and risky customers.

  • Minimal False Positives: The model only wrongly predicts 4 good customers as risky out of 1505 good customers. This is a very low number and indicates that the chance of a good customer being rejected is minimal.

  • No False Negatives: The fact that there are no false negatives (0 customers wrongly predicted as good when they are risky) is excellent for Bankaya. It means they won't be approving loans to customers who are likely to default.


Model Interpretation¶

Feature Importance¶

In [83]:
import pandas as pd
import plotly.express as px

# I'm assuming 'model_numerical_vars' and 'feature_importances' are defined elsewhere in your code
df_importance = pd.DataFrame({
    'Feature': model_numerical_vars,
    'Importance': feature_importances
})

df_importance = df_importance.sort_values('Importance', ascending=True)

fig = px.bar(df_importance, x='Importance', y='Feature', orientation='h', 
             title='Feature Importance XGBoost', labels={'Importance': 'Importance Score'})

# Set the figure size
fig.update_layout(
    autosize=False,
    width=1000,    # You can adjust the width to your preference
    height=800,    # You can adjust the height to make sure all labels are visible
    margin=dict(
        l=50,  # left margin, increase if your y labels are getting cut off
        r=50,  # right margin
        b=100, # bottom margin
        t=100, # top margin
    )
)

# You can also modify the font size of the labels if necessary
fig.update_layout(
    title_font_size=20,
    font=dict(
        family="Arial, Balto, Courier New, Droid Sans",  # Choose a font family
        size=12,  # Set the global font size
        color="RebeccaPurple"  # Set the font color
    )
)
  1. Most Important Features:
  • max_days_late: This is by far the most significant feature. The number of days a customer has been late in payments in the past can be a strong indicator of their future reliability. Customers with a high number of max days late are potentially more risky.
  • total_credits and last_payment_days: These features are also crucial. The total number of credits a customer has and the number of days since their last payment can indicate their financial health and payment behavior.
  1. Credit Related Features:
  • titular_con_avl_sum, gen_cred_type, and prevention_key_risk_score: Their importance indicates that the type of credit and risk scores associated with the applicant's previous engagements play a role in determining their reliability.
  1. Application History:
  • Features like n_sf_apps, n_bnp_apps, and n_bnp_approved_apps_mean likely pertain to the number of loan applications . A high number of applications, especially if not approved, might indicate desperation or financial instability.

Results and Recommendations¶

Conclusions:

Credit Approval Criteria:

  • The average max days late, high number of inquiries, and current balance are strong indicators of potential credit risk. Customers with a short time between account creation and loan application do not necessarily present a higher risk, which challenges the assumption that such behavior is impulsive and therefore risky. Customers with longer tenure appear to be more reliable with fewer late payments, which suggests a relationship between customer engagement and creditworthiness.

Customer Segmentation:

  • The data reveals distinct customer behaviors across different clusters (short, medium, long tenure). This segmentation can be leveraged for tailored marketing, education, and product offerings. The presence of a significant proportion of customers with zero payments and zero credit limit indicates a segment that either has no need for credit or is potentially underbanked.

Portfolio Health:

  • The presence of fixed payment schemes in the financial services sector ("PAGOS FIJOS") with high fixed payment latencies indicates a risk concentration that should be addressed. The variations in payment amounts and credit limits suggest a need for personalized credit management.

Recommendations:

Risk Mitigation:

  • Implement a tiered credit approval process where customers with indicators of higher risk (such as high max days late, a high number of inquiries, or high current balances) are subjected to more rigorous scrutiny or are offered products with different terms (e.g., lower credit limits or higher interest rates to offset risk).

  • Develop targeted financial education programs for new customers, especially in clusters with shorter tenures, to encourage financial responsibility.

Customer Engagement:

  • Design engagement and retention campaigns around the 6-7 month mark, which is the average time customers take from account creation to loan application. Develop loyalty programs and premium services for long-tenure customers to capitalize on their demonstrated loyalty.

  • Past Behavior is Crucial: Features related to past payment behavior, like max_days_late and last_payment_days, are vital. Bankaya should closely evaluate the payment history of applicants.

  • Diverse Financial Footprint: The variety of features from account types to credit types indicates that customers with a diverse financial footprint (various accounts, credits) can provide more data points to assess their reliability.

  • Application History: It's essential to consider how often a customer has applied for loans and how many have been approved. A high number of rejected applications might be a red flag.

Model Deployment and Monitoring:

  • With the XGBoost model performing well in differentiating between good and risky customers, deploy the model while setting up a monitoring system to continuously track its performance and catch any signs of overfitting.
  • Regularly update the model with new data and conduct feature importance analysis to understand the changing dynamics of customer risk profiles.

False Positives and Negatives:

  • Although the model has a low rate of false positives and no false negatives, continuous monitoring is important to maintain this performance as new data comes in, and to ensure that the bank is not rejecting potential good customers or accepting risky ones.
In [ ]: